Best AI Video Creation Tools

Best AI Video Creation Tools — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • BioBIKE

    BioBIKE

    BioBike(nee. BioLingua ) is a cloud-based, through-the-web programmable (Paas) symbolic biocomputing and bioinformatics platform that aims to make computational biology, and especially intelligent biocomputing (that is, the application of Artificial Intelligence to computational biology) accessible to research scientists who are not expert programmers. == Unique capabilities == BioBIKE is an integrated symbolic biocomputing and bioinformatics platform, built from the start as an entirely (what is now called) cloud-based architecture where all computing is done in remote servers, and all user access is accomplished through web browsers. BioBIKE has a built-in frame system in which all objects, data, and knowledge are represented. This enables code written either in the native Lisp, in the visual programming language, or systems of rules expressed in the SNARK theorem prover to access the whole of biological knowledge in an integrated manner. For its time (released in 2002) it was unique in permitting users to create fully functional biocomputing programs that run on the back-end servers entirely through the web browser UI. (In modern terms it was one of the first PaaS (Platform as a Service) systems, predating even Salesforce in this capability.) Initially this programming was carried out in raw Lisp, but Jeff Elhai's team at VCU, with NSF funding, created an entirely graphical programming environment on top of BioBIKE based upon the Boxer-style programming environments. Being a multi-headed, multi-threaded, multi-user, multi-tenancy cloud-based system, BioBIKE users were able to directly work together through their web browsers, remotely sharing the same listener and memory space. This permitted a unique sort of collaboration, discussed in Shrager (2007). A specialized offshoot of BioBIKE called "BioDeducta" includes SRI's SNARK theorem prover, offering unique "deductive biocomputing" capabilities. == Implementation == BioBIKE is open-source software implemented using the Lisp programming language. Continuing development takes place by the BioBIKE team centered at Virginia Commonwealth University . == History == BioBIKE was originally called "BioLingua", and was developed by Jeff Shrager at The Carnegie Inst. of Washington Dept. of Plant Biology, and JP Massar with funding from NASA's Astrobiology Division. Shrager and Massar wanted to create a web-based, multi-user Lisp Machine, specialized for bioinformatics. Other early contributors to the project included Mike Travers, and Jeff Elhai of VCU. Elhai obtained continuing funding from the National Science Foundation for the project, which was renamed BioBIKE. Elhai and colleagues added BioBIKE's unique visual programming language. Shrager, meanwhile, collaborated with Richard Waldinger at SRI to build SRI's (SNARK) theorem prover into BioBIKE, creating a deductive biocomputing system, called BioDeducta. == Instances == There used to be a number of BioBIKE verticals in different biological domains, including viral pathogens, cyanobacteria and other bacteria, Arabidopsis thaliana, and several others described in the references.

    Read more →
  • Maia and Marco

    Maia and Marco

    Maia and Marco are artificial intelligence used by GMA Network. Unveiled in 2023, they are used to fulfill the role of sports newscasters. == Background == Maia and Marco are artificial intelligence (AI) which take the form of three-dimensional human avatars. Maia makes use of a female avatar while Marco uses a male likeness. They have aesthetic features that are typical to Filipino showbusiness personalities. Among the technologies used in making and operating the AI include image generation, text-to-speech AI voice synthesis/generation, and deep learning face animation. They are also demonstrated to be bilingual, being able to speak in English and Tagalog (Filipino). == Use == The AI pair was unveiled by GMA Network on September 24, 2023, for their coverage of Season 99 of the National Collegiate Athletic Association (NCAA). Fulfilling the role of sports newscasters, Maia and Marco would join GMA's courtside human reporters. The AI pair are scheduled to appear four times a month on GMA's digital media platforms. They will not appear in traditional television broadcast. == Reception == The launch of the Maia and Marco was met with strong reactions. Various journalists and other personalities across the Philippine media industry expressed concern that their employment be at risk with the introduction of AI. The quality of the AI ability to emulate human behavior was characterized by critics as "soulless". GMA responding to concerns has stated that the AI would complement rather than replace its live human journalists including sportscasters. The National Union of Journalists of the Philippines urged dialogue among its peers in the newsroom on policy on how to use AI, which the group acknowledge as "inevitable".

    Read more →
  • Radiant AI

    Radiant AI

    The Radiant AI is a technology developed by Bethesda Softworks for The Elder Scrolls video games. It allows non-player characters (NPCs) to make choices and engage in behaviors more complex than in past titles. The technology was developed for The Elder Scrolls IV: Oblivion and expanded in The Elder Scrolls V: Skyrim; it is also used in Fallout 3, Fallout: New Vegas and Fallout 4, also published by Bethesda, with 3 and 4 being developed by them as well. == Technology == The Radiant AI technology, as it evolved in its iteration developed for Skyrim, comprises two parts: === Radiant AI === The Radiant AI system deals with NPC interactions and behavior. It allows non-player characters to dynamically react to and interact with the world around them. General goals, such as "Eat in this location at 2pm" are given to NPCs, and NPCs are left to determine how to achieve them. The absence of individual scripting for each character allows for the construction of a world on a much larger scale than other games had developed, and aids in the creation of what Todd Howard described as an "organic feel" for the game. === Radiant Story === The Radiant Story system deals with how the game itself reacts to the player behavior, such as the creation of new dynamic quests. Dynamically generated quests are placed by the game in locations the player hasn't visited yet and are related to earlier adventures.

    Read more →
  • AlphaEvolve

    AlphaEvolve

    AlphaEvolve is an evolutionary coding agent for designing advanced algorithms based on large language models such as Gemini. It was developed by Google DeepMind and unveiled in May 2025. == Design == AlphaEvolve aims to autonomously discover and refine algorithms through a combination of large language models (LLMs) and evolutionary computation. AlphaEvolve needs an evaluation function with metrics to optimize, and an initial algorithm. At each step, AlphaEvolve uses the LLM to produce variants of the existing algorithms, and then selects the most effective ones. Unlike domain-specific predecessors like AlphaFold or AlphaTensor, AlphaEvolve is designed as a general-purpose system. It can operate across a wide array of scientific and engineering tasks by automatically modifying code and optimizing for multiple objectives. Its architecture allows it to evaluate code programmatically, reducing reliance on human input and mitigating risks such as hallucinations common in standard LLM outputs. == Achievements == According to Google, across a selection of 50 open mathematical problems, the model was able to rediscover state-of-the-art solutions 75% of the time and discovered improved solutions 20% of the time, for example advancing the kissing number problem. AlphaEvolve was also used to optimize Google's computing ecosystem. Improved data center scheduling heuristics, enabled the recovery of 0.7% of stranded resources. It was also used to optimize TPU circuit design and Gemini's training matrix multiplication kernel. == Open source implementations == Following the publication of AlphaEvolve, several open source implementations have been developed by the research community. One such implementation is OpenEvolve, which implements distributed evolutionary algorithms, multi-language support, integration with various large language model providers, and automated discovery of high-performance GPU kernels that outperform expert-engineered baselines.

    Read more →
  • Aarogya Setu

    Aarogya Setu

    Aarogya Setu (lit. 'The bridge to health') is an Indian COVID-19 "contact tracing, syndromic mapping and self-assessment" digital service, primarily a mobile app, developed by the National Informatics Centre under the Ministry of Electronics and Information Technology (MeitY). The app reached more than 100 million installs in 40 days. On 26 May, amid growing privacy and security concerns, the source code of the app was made public. == Full view == The stated purpose of this app is to spread awareness of COVID-19 and to connect essential COVID-19-related health services to the people of India. This app augments the initiatives of the Department of Health to contain COVID-19 and shares best practices and advisories. It is a tracking app which uses the smartphone's GPS and Bluetooth features to track COVID-19 cases. The app is available for Android and iOS mobile operating systems. With Bluetooth, it tries to determine the risk if one has been near (within six feet of) a COVID-19-infected person, by scanning through a database of known cases across India. Using location information, it determines whether the location one is in belongs to one of the infected areas based on the data available. This app is an updated version of an earlier app called Corona Kavach (now discontinued) which was released earlier by the Government of India. == Features and tools == Aarogya Setu has four sections: User Status (tells the risk of getting COVID-19 for the user) Self Assess (helps the users identify COVID-19 symptoms and their risk profile) COVID-19 Updates (gives updates on local and national COVID-19 cases) E-pass integration (if applied for E-pass, it will be available) See Recent Contacts option (allows the users to assess the risk level of their Bluetooth contacts) It tells how many COVID-19 positive cases are likely in a radius of 500 m, 1 km, 2 km, 5 km and 10 km from the user. The app is built on a platform that can provide an application programming interface (API) so that other computer programs, mobile applications, and web services can make use of the features and data available in Aarogya Setu. == Response == Aarogya Setu crossed five million downloads within three days of its launch, making it one of the most popular government apps in India. It became the world's fastest-growing mobile app, beating Pokémon Go, with more than 50 million installs 13 days after launching in India on 2 April 2020. It reached 100 million installs by 13 May 2020, that is in 40 days since its launch. In an order on 29 April 2020 the central government made it mandatory for all employees to download the app and use it – "Before starting for office, they must review their status on Aarogya Setu and commute only when the app shows safe or low risk". The Union Home Ministry also said that the application is mandatory for all living in the COVID-19 containment zone. The government gave the announcement along with the nationwide lockdown extension by two weeks from the 4 May with certain relaxations. On 21 May 2020, the Airport Authority of India issued a Standard Operating Procedure (SOP) stating that all departing passengers must compulsorily be registered with the Aarogya Setu app. It added that the app would not be mandatory for children below 14 years. However, the next day, Civil Aviation Minister Hardeep Singh Puri clarified that the app would not be mandatory for any passengers. On 26 May 2020, the Aarogya Setu app code was made open to developers across the globe to help other countries manage contact tracing in their fight against COVID-19 pandemic. In March 2021, Co-WIN portal was integrated with the app. This allowed users to schedule an appointment through the app for COVID-19 vaccine by registering their phone number and providing relevant documents. == Effectiveness == NITI Aayog CEO revealed that "the app has been able to identify more than 3,000 hotspots in 3–17 days ahead of time." However, users and experts in India and around the world say the app raises huge data security concerns. The app collects name, number, gender, travel history, and uses a phone's Bluetooth and location data to let users know if they have been near a person with COVID-19 by scanning a database of known cases of infection, and also share it with the government simultaneously. This is the major area of concern as the app's constant access to a phone's Bluetooth imposes a form of security threat. But it stood to clarify itself that the informations received are not going to be made public. Amidst all these, the app hits a record of about one-hundred million downloads. == Reception == Rahul Gandhi, leader of the Congress party, termed the Aarogya Setu application a "sophisticated surveillance system" after the government announced that downloading the app would be mandatory for both government and private employees. Following this, others raised the same concerns about the Aarogya Setu app. The Ministry of Electronics and Information Technology (MeitY) responded to these concerns by asserting that Gandhi's claims were false, and that the app was being appreciated internationally. On 5 May, French ethical hacker Robert Baptiste, who goes by the name Elliot Alderson on Twitter, claimed that there were security issues with the app. The Indian government, as well as the app developers, responded to this claim by thanking the hacker for his attention, but dismissed his concerns. The developers of the app stated that the fetching of location data is a documented feature of the app, rather than a flaw, since the app is designed to track the distribution of the virus-infected population. They also asserted that no personal information of any user has been proven to be at risk. On 6 May, Robert Baptiste tweeted that security vulnerabilities in Aarogya Setu allowed hackers to "know who is infected, unwell, [or] made a self assessment in the area of his choice". He also gave details of how many people were unwell and infected at the Prime Minister's Office, the Indian Parliament and the Home Office. The Economic Times pointed out that a clause in the app's Terms and Conditions stated that the user "agrees and acknowledges that the Government of India will not be liable for ... any unauthorised access to your information or modification thereof". In response, several software developers called for the source code to be made public. On 12 May, former Supreme Court Judge Justice B.N. Srikrishna termed the government's push mandating the use of Aarogya Setu app "utterly illegal". He said so far it is not backed by any law and questioned "under what law, government is mandating it on anyone". MIT Technology Review gave 2 out of 5 stars to Aarogya Setu app after analyzing the COVID contact tracing apps launched in 25 countries. The app got stars only for the policy which suggests that data collected is deleted after a period of time and that the data collection, as far as user inputs go, is minimal. It also highlighted that India is the only democracy making its app mandatory for millions of people. The rating was further downgraded from 2 to 1 for collecting more information than the app needs to function. Following this, the MeitY made the source code of the Android app public on GitHub on 26 May, which will be followed by iOS and API documentation. Further, the Government has also launched a "bug bounty program". This was done to "promote transparency and ensure security and integrity of the app". However, experts stated that the server-side code had not yet been publicly released, which meant that public opinion on security and privacy was yet to be completely assuaged. Following this, ZDNet noted that the source code seemed to confirm the government's claim that user location data, if collected, would be anonymised and would be deleted after 45 days, or 60 days for high-risk individuals.

    Read more →
  • Feigenbaum test

    Feigenbaum test

    A Feigenbaum test is a variation of the Turing test where a computer system attempts to replicate an expert in a given field such as chemistry or marketing. It is also known, as a subject matter expert Turing test and was proposed by Edward Feigenbaum in a 2003 paper. The concept is also described by Ray Kurzweil in his 2005 book The Singularity is Near. Kurzweil argues that machines who pass this test are an inevitable consequence of Moore's Law.

    Read more →
  • Repertory grid

    Repertory grid

    The repertory grid is an interviewing technique which uses nonparametric factor analysis to determine an idiographic measure of personality. It was devised by George Kelly in around 1955 and is based on his personal construct theory of personality. == Introduction == The repertory grid is a technique for identifying the ways that a person construes (interprets or gives meaning to) his or her experience. It provides information from which inferences about personality can be made, but it is not a personality test in the conventional sense. It is underpinned by the personal construct theory developed by George Kelly, first published in 1955. A grid consists of four parts: A topic: it is about some part of the person's experience. A set of elements, which are examples or instances of the topic. Working as a clinical psychologist, Kelly was interested in how his clients construed people in the roles they adopted towards the client, and so, originally, such terms as "my father", "my mother", "an admired friend" and so forth were used. Since then, the grid has been used in much wider settings (educational, occupational, organisational) and so any well-defined set of words, phrases, or even brief behavioral vignettes can be used as elements. For example, to see how a person construes the purchase of a car, a list of vehicles within that person's price range could be a set of elements. A set of constructs. These are the basic terms that the client uses to make sense of the elements, and are always expressed as a contrast. Thus the meaning of "good" depends on whether you intend to say "good versus poor", as if you were construing a theatrical performance, or "good versus evil", as if you were construing the moral or ontological status of some more fundamental experience. A set of ratings of elements on constructs. Each element is positioned between the two extremes of the construct using a 5- or 7-point rating scale system; this is done repeatedly for all the constructs that apply; and thus its meaning to the client is modeled, and statistical analysis varying from simple counting, to more complex multivariate analysis of meaning, is made possible. Constructs are regarded as personal to the client, who is psychologically similar to other people depending on the extent to which they would tend to use similar constructs, and similar ratings, in relating to a particular set of elements. The client is asked to consider the elements three at a time, and to identify a way in which two of the elements might be seen as alike, but distinct from, contrasted to, the third. For example, in considering a set of people as part of a topic dealing with personal relationships, a client might say that the element "my father" and the element "my boss" are similar because they are both fairly tense individuals, whereas the element "my wife" is different because she is "relaxed". And so we identify one construct that the individual uses when thinking about people: whether they are "tense as distinct from relaxed". In practice, good grid interview technique would delve a little deeper and identify some more behaviorally explicit description of "tense versus relaxed". All the elements are rated on the construct, further triads of elements are compared and further constructs elicited, and the interview would continue until no further constructs are obtained. == Using the repertory grid == Careful interviewing to identify what the individual means by the words initially proposed, using a 5-point rating system could be used to characterize the way in which a group of fellow-employees are viewed on the construct "keen and committed versus energies elsewhere", a 1 indicating that the left pole of the construct applies ("keen and committed") and a 5 indicating that the right pole of the construct applies ("energies elsewhere"). On being asked to rate all of the elements, our interviewee might reply that Tom merits a 2 (fairly keen and committed), Mary a 1 (very keen and committed), and Peter a 5 (his energies are very much outside the place of employment). The remaining elements (another five people, for example) are then rated on this construct. Typically (and depending on the topic) people have a limited number of genuinely different constructs for any one topic: 6 to 16 are common when they talk about their job or their occupation, for example. The richness of people's meaning structures comes from the many different ways in which a limited number of constructs can be applied to individual elements. A person may indicate that Tom is fairly keen, very experienced, lacks social skills, is a good technical supervisor, can be trusted to follow complex instructions accurately, has no sense of humour, will always return a favour but only sometimes help his co-workers, while Mary is very keen, fairly experienced, has good social and technical supervisory skills, needs complex instructions explained to her, appreciates a joke, always returns favours, and is very helpful to her co-workers: these are two very different and complex pictures, using just 8 constructs about a person's co-workers. Important information can be obtained by including self-elements such as "Myself as I am now"; "Myself as I would like to be" among other elements, where the topic permits. == Analysis of results == A single grid can be analysed for both content (eyeball inspection) and structure (cluster analysis, principal component analysis, and a variety of structural indices relating to the complexity and range of the ratings being the chief techniques used). Sets of grids are dealt with using one or other of a variety of content analysis techniques. A range of associated techniques can be used to provide precise, operationally defined expressions of an interviewee's constructs, or a detailed expression of the interviewee's personal values, and all of these techniques are used in a collaborative way. The repertory grid is emphatically not a standardized "psychological test"; it is an exercise in the mutual negotiation of a person's meanings. The repertory grid has found favour among both academics and practitioners in a great variety of fields because it provides a way of describing people's construct systems (loosely, understanding people's perceptions) without prejudging the terms of reference—a kind of personalized grounded theory. Unlike a conventional rating-scale questionnaire, it is not the investigator but the interviewee who provides the constructs on which a topic is rated. Market researchers, trainers, teachers, guidance counsellors, new product developers, sports scientists, and knowledge capture specialists are among the users who find the technique (originally developed for use in clinical psychology) helpful. == Relationship to other tools == In the book Personal Construct Methodology, researchers Brian R. Gaines and Mildred L.G. Shaw noted that they "have also found concept mapping and semantic network tools to be complementary to repertory grid tools and generally use both in most studies" but that they "see less use of network representations in PCP [personal construct psychology] studies than is appropriate". They encouraged practitioners to use semantic network techniques in addition to the repertory grid.

    Read more →
  • Safe Superintelligence Inc.

    Safe Superintelligence Inc.

    Safe Superintelligence Inc. (SSI Inc.) is an Israeli-American artificial intelligence company founded by Ilya Sutskever, the former chief scientist of OpenAI; Daniel Gross, former head of Apple’s AI efforts; and Daniel Levy, an investor and AI researcher. The company's mission is to focus on safely developing a superintelligence, a computer-based agent capable of surpassing human intelligence. == History == On May 15, 2024, OpenAI co-founder Ilya Sutskever left OpenAI after a board dispute where he voted to fire Sam Altman amid concerns about communication and trust. Sutskever and others additionally believed that OpenAI was neglecting its original focus on safety in favor of pursuing opportunities for commercialization. On June 19, 2024, Sutskever posted on X that he was starting SSI Inc, with the goal to safely develop superintelligent AI, alongside Daniel Levy, and Daniel Gross. The company, composed of a small team, is split between Palo Alto, California and Tel Aviv, Israel. In September 2024, SSI revealed it had raised $1 billion from venture capital firms including SV Angel, DST Global, Sequoia Capital, and Andreessen Horowitz. The money will be used to build up more computing power and hire top individuals in the field. In March 2025, SSI reached a $30 billion valuation in a funding round led by Greenoaks Capital. This is six times its previous $5 billion valuation from September 2024. Despite not yet generating revenue and having approximately 20 employees, the company has attracted significant investor interest, largely due to co-founder Ilya Sutskever's reputation and its focus on developing safe superintelligence. In April 2025, Google Cloud announced a partnership to provide TPUs for SSI's research. In the first half of 2025, Meta attempted to acquire SSI but was rebuffed by Sutskever. In July 2025, co-founder Gross left the company to join Meta Superintelligence Labs, and Sutskever became the CEO of SSI.

    Read more →
  • Human–AI interaction

    Human–AI interaction

    Human–AI interaction is a developing field of research and a sub-field of human–computer interaction (HCI). HCI is a field of research that explores the interactions between humans and computer-based technology, focusing on design implementation, user experience, and psychological factors. With the proliferation of artificial intelligence (AI), there has developed a sub-section of HCI research dedicated specifically to artificial intelligence and how people interact with and are impacted by it. This is human–AI interaction, abbreviated either as HAX or HAII. == Introduction == Artificial intelligence (AI), in general, has fluid definitions and varied research applications, but in brief can be applied to mechanizing tasks that would require human intelligence to complete. AI are tools designed to replicate the human abilities of navigating uncertainty, active learning, and processing information in different contexts. Within the context of HCI and HAX research, artificial intelligence can be broken into two sub-fields, natural language processing (NLP) and computer vision (CV). AI technologies notably include machine-learning, deep-learning and neural networks, and large-language models (LLMs). As a new and rapidly developing technology, AI is changing how computers work and therefore changing how humans interact with computers. Unlike the traditional human-computer interaction, where a human directs a machine, human-AI interaction is characterized by a more collaborative relationship between the computer program (the AI) and the human user, as AI is perceived as an active agent rather than a tool. This changing dynamic creates new questions and necessitates new research methods that are not present in traditional HCI research. According to a scoping review on the state of the discipline, the HAX field comprises research on the "design, development, and evaluation of AI systems" and encompasses the themes of human-AI collaboration, human-AI competition, human-AI conflict, and human-AI symbiosis. == Design == Machine learning and artificial intelligence have been used for decades in targeted advertising and to recommend content in social media. Ethical Guidelines (Framework for ethical AI development) == User Experience (UX) == This section should handle research on how users interact with tools. What techniques do they use, do they develop habits, what types of programs and devices are they using to access these tools, what do they use these tools to do exactly. === Cognitive Frameworks in AI Tool Users === AI has been viewed with various expectations, attributions, and often misconceptions. Many people exclusively understand AI as the LLM chatbots they interact with, like ChatGPT or Claude, or other generative AI programs. [Insert section: discuss how people interact with these specific AI tools as a connection to the following paragraphs] Most fundamentally, humans have a mental model of understanding AI's reasoning and motivation for its decision recommendations, and building a holistic and precise mental model of AI helps people create prompts to receive more valuable responses from AI. However, these mental models are not whole because people can only gain more information about AI through their limited interaction with it; more interaction with AI builds a better mental model that a person may build to produce better prompt outcomes. Research on human-AI interaction has emphasized that users develop mental models of AI systems and revise those models through repeated use, feedback, and explanation, while design research has stressed the importance of communicating capabilities and limitations early and supporting trust calibration through explanation and correction. In a 2025 SSRN working paper, John DeVadoss proposed "Hypothetico-Deductive Interaction" (HDI), a framework that describes human-AI interaction as a mutual process of conjecture and refutation in which users test assumptions about an AI system's capabilities while the system infers and updates assumptions about user goals through its responses and clarifying questions. DeVadoss argued that this framing helps explain prompt iteration, weak capability awareness, and trust miscalibration, and suggested design responses such as clearer communication of uncertainty, easier correction, actionable explanations, and safer failure modes. == Research themes == === Human-AI collaboration === Human-AI collaboration occurs when the human and AI supervise the task on the same level and extent to achieve the same goal. Some collaboration occurs in the form of augmenting human capability. AI may help human ability in analysis and decision-making through providing and weighing a volume of information, and learning to defer to the human decision when it recognizes its unreliability. It is especially beneficial when the human can detect a task that AI can be trusted to make few errors so that there is not a lot of excessive checking process required on the human's end. Some findings show signs of human-AI augmentation, or human–AI symbiosis, in which AI enhances human ability in a way that co-working on a task with AI produces better outcomes than a human working alone. For example: the quality and speed of customer service tasks increase when a human agent collaborates with AI, training on specific models allows AI to improve diagnoses in clinical settings, and AI with human-intervention can improve creativity of artwork while fully AI-generated haikus were rated negatively. Human-AI synergy, a concept in which human-AI collaboration would produce more optimal outcomes than either human or AI working alone could explain why AI does not always help with performance. Some AI features and development may accelerate human-AI synergy, while others may stagnate it. For example, when AI updates for better performance, it sometimes worsens the team performance with human and AI by reducing the compatibility with the new model and the mental model a user has developed on the previous version. Research has found that AI often supports human capabilities in the form of human-AI augmentation and not human-AI synergy, potentially because people rely too much on AI and stop thinking on their own. Prompting people to actively engage in analysis and think when to follow AI recommendations reduces their over-reliance, especially for individuals with higher need for cognition. === Human-AI competition === Robots and computers have substituted routine tasks historically completed by humans, but agentic AI has made it possible to also replace cognitive tasks including taking phone calls for appointments and driving a car. At the point of 2016, research has estimated that 45% of paid activities could be replaced by AI by 2030. Perceived autonomy of robots is known to increase people's negative attitude toward them, and worry about the technology taking over leads people to reject it. There has been a consistent tendency of algorithm aversion in which people prefer human advice over AI advice. However, people are not always able to tell apart tasks completed by AI or other humans. See AI takeover for more information. It is also notable that this sentiment is more prominent in the Western cultures as Westerners tend to show less positive views about AI compared to East Asians. == Research on the psychological impacts of AI == === Perception on others who use AI === As much as people perceive and make judgment about AI itself, they also form impressions of themselves and others who use AI. In the workplace, employees who disclose the use of AI in their tasks are more likely to receive feedback that they are not as hardworking as those who are in the same job who receive non-AI help to complete the same tasks. AI use disclosure diminishes the perceived legitimacy in the employee's task and decision making which ultimately leads observers to distrust people who use AI. Although these negative effects of AI use disclosure are weakened by the observers who use AI frequently themselves, the effect is still not attenuated by the observers' positive attitude towards AI. === Bias, AI, and human === Although AI provides a wide range of information and suggestions to its users, AI itself is not free of biases and stereotypes, and it does not always help people reduce their cognitive errors and biases. People are prone to such errors by failing to see other potential ideas and cases that are not listed by AI responses and committing to a decision suggested by AI that directly contradicts the correct information and directions that they are already aware of. Gender bias is also reflected as the female gendering of AI technologies which conceptualizes females as a helpful assistant. == Emotional connection with AI == Human-AI interaction has been theorized in the context of interpersonal relationships mainly in social psychology, communications and media studies, and as a technology interface through the lens of hu

    Read more →
  • User profile

    User profile

    A user profile is a collection of settings and information associated with a user. It contains critical information that is used to identify an individual, such as their name, age, portrait photograph and individual characteristics such as knowledge or expertise. User profiles are most commonly present on social media websites such as Facebook, Instagram, and LinkedIn; and serve as voluntary digital identity of an individual, highlighting their key features and traits. In personal computing and operating systems, user profiles serve to categorise files, settings, and documents by individual user environments, known as 'accounts', allowing the operating system to be more friendly and catered to the user. Physical user profiles serve as identity documents such as passports, driving licenses and legal documents that are used to identify an individual under the legal system. A user profile can also be considered as the computer representation of a user model. A user model is a (data) structure that is used to capture certain characteristics about an individual user, and the process of obtaining the user profile is called user modeling or profiling. == Origin == The origin of user profiles can be traced to the origin of the passport, an identity document (ID) made mandatory in 1920, after World War I following negotiations at the League of Nations. The passport served as an official government record of an individual. Consequently, Immigration Act of 1924 was established to identify an individual's country of origin. In the 21st century, passports have now become a highly sought-after commodity as it is widely accepted as a source of verifying an individual's identity under the legal system. With the advent of digital revolution and social media websites, user profiles have transitioned to an organised group of data describing the interaction between a user and a system. Social media sites like Instagram allow individuals to create profiles that are representative of their desired personality and image. Filling all fields of profile information may not be necessary to create a meaningful self-presentation, which grants individual more control over of the identity they wish to present by displaying the most meaningful attributes. A personal user profile is a key aspect of an individual's social networking experience, around which his/her public identity is built. == Types of user profiles == A user profile can be of any format if it contains information, settings and/or characteristics specific to an individual. Most popular user profiles include those on photo and video sharing websites such as Facebook and Instagram, accounts on operating systems, such as those on Windows and MacOS and physical documents such as passports and driving licenses. === Social media === Effectively structured user profiles on social media channels such as Instagram and Facebook offer a way for people to form impressions about someone that is predictive or similarly meeting them offline. The condensed format of social media profiles allows for quick filtering of millions of profiles by matching individuals by similar characteristics and interests; information provided upon sign up. A research conducted highlights that only a "thin slice" of information is required to form an impression about an individual online (Stecher and Counts 2008). Online user profiles eliminate the complexity of interaction that is present in 'face-to-face' meetings such as behavioural, facial, and environmental information, resulting in increased predictiveness of user personality. Dating apps and websites solely rely on an individual's user profile and the information provided to form interactions and communication with others on the platform. Despite having control over presented information, lying is minimal in online dating contexts (Hancock, Toma and Ellison, 2007). Apps such as Bumble allow users to 'match' with other individuals based on their characteristics and selected filters that allow users to narrow the spectrum of search to their preference. Information for a user's profile is voluntarily specified by the user and includes information such as height, interests, photographs, gender or education. The requirement of information varies respective to each platform, and there surrounds little consensus to an appropriate amount of information for a condensed user profile. Universally, all social networking platforms display an individual's profile picture and an "about me" page that allows for self-expression. === Influencers === Influencer user profiles are third party endorsers who shape audience attitudes and decisions through social media content such as photos, blogs and tweets. Social Media Influencers (SMI) often hold a significant following on a social media platform which enables them to be recognised as opinion leaders to shape an information influence to their audience. 'Influencer marketing' industry gained prominence in 2018, when the photo sharing app Instagram crossed 1 billion users, subsequently with approximately 60,000 google search queries for 'influencer marketing' the same year. Influencer user profiles hold a unique selling point, or public personality that is unique and charismatic to the needs and wants of their target audience. SMI profiles advertise product information, latest promotions and regularly engage with their followers to maintain their online persona. Messages endorsed by social media influencers are often perceived as reliable and compelling, as a study conducted found 82% of followers were more inclined to follow the suggestions of their favorite influencer. This allows advertisers to leverage online user profiles and their audience rapport to target younger and niche audiences. According to a market survey, influencer marketing through social media profiles yields a return 11 times higher than traditional marketing, as they are more capable of communicating to a niche segment. Most popular influencers include sport starts such as Cristiano Ronaldo and Hollywood personalities such as Dwayne Johnson and Kylie Jenner each with over 200 million followers respectively. === Ecommerce === Online shopping or Ecommerce websites such as Amazon use information from a customer's user profile and interests to generate a list of recommended items to shop. Recommendation algorithms analyse user demographic data, history, and favourite artists to compile suggestions. The store rapidly adapts to changing user needs and preferences, with generation of real time results required within half of a second. New profiles naturally have limited information for algorithms to analyse, and customer data of each interaction provides valuable information which is stored as a database linked with each individual profile. User profiles on ecommerce websites also serve to improve sales of sellers as individuals are recommend products that other "customers who bought this item also bought" to widen the selection of the buyer. A study conducted found that user profiles and recommendation algorithms have significant impact on related product sales and overall spending of an individual. A process known as "collaborative filtering" tries to analyse common products of interest for an individual on the basis of views expressed by other similar behaving profiles. Features such as product ratings, seller ratings and comments allow individual user profiles to contribute to recommendation algorithms, eliminate adverse selection and contribute to shaping an online marketplace adhering to Amazons zero tolerance policy for misleading products. == Digital user profiles == Modern software and applications account for user profiles as a foundation on which a usable application is built. The structure and layout of an application such as its menus, features and controls are often derived from user's selected settings and preferences. The origin of digital user profiles in computer systems was first initiated by Windows NT that held user settings and information in a separate environment variable named %USERPROFILE% and held the framework to a user's profile root. Consequently, operating systems such as MacOS further accelerated prominence of user profiles in Mac OS X 10.0. Iterations since have been made with each operating system release with the aim to maximise user friendliness with the system. Features such as keyboard layouts, time zones, measurement units, synchronisation of different services and privacy preferences are made available during the setup of a user account on the computer === Types of accounts === ==== Administrator ==== Administrator user profiles have complete access to the system and its permissions. It is often the first user profile on a system by design, and is what allows other accounts to be created. However, since the administrator account has no restrictions, they are highly vulnerable to malware and viruses, with potential to impact all other accounts.

    Read more →
  • Fei-Fei Li

    Fei-Fei Li

    Fei-Fei Li (Chinese: 李飞飞; pinyin: Lǐ Fēifēi; born July 3, 1976) is a Chinese-born American computer scientist best known for establishing ImageNet, the dataset that enabled rapid advances in computer vision in the 2010s. She is a professor of computer science at Stanford University, with research expertise in artificial intelligence, machine learning, deep learning, computer vision, and cognitive neuroscience. Li is a co-director of the Stanford Institute for Human-Centered Artificial Intelligence and a co-director of the Stanford Vision and Learning Lab, and served as Chief Scientist of AI/ML at Google Cloud and the director of the Stanford Artificial Intelligence Laboratory from 2013 to 2018. In 2017, she co-founded AI4ALL, a nonprofit organization working to increase diversity in the field of artificial intelligence. In 2023, Li was named one of the Time 100 AI Most Influential People. Li received the Intel Lifetime Achievements Innovation Award in 2017 for her contributions to artificial intelligence, and was elected member of the National Academy of Engineering, the National Academy of Medicine in 2020 and the American Academy of Arts and Sciences in 2021. In 2025, she was named as one of the "Architects of AI" for Time's Person of the Year. On August 3, 2023, Li was appointed to the United Nations Scientific Advisory Board, established by Secretary-General Antonio Guterres. In 2024, Li was included on the Gold House's most influential Asian A100 list. In 2024, she raised $230 million for a startup called World Labs, which she and three colleagues founded to develop a "spatial intelligence" AI technology that can understand how the three-dimensional physical world works. In 2026, World Labs raised $1 Billion. == Early life and education == Li was born in Beijing, China, in 1976 and grew up in Chengdu, Sichuan. She studied at Sichuan Chengdu No.7 High School. When she was 12, her father immigrated to Parsippany, New Jersey. When she was 16, Li and her mother joined him in the United States. While attending Parsippany High School, Li worked weekends at her family's dry-cleaning shop. She graduated from Parsippany High School in 1995. She was inducted into the hall of fame at Parsippany High School in 2017. Li pursued undergraduate study at Princeton University, where she received a Bachelor of Arts with a major in physics in 1999. Li completed her senior thesis, "Auditory binaural correlogram difference: a new computational model for Huggins dichotic pitch", under the supervision of Bradley Dickinson, professor of electrical engineering. During her years at Princeton, Li returned home most weekends to help run her family's dry cleaning business and worked as a dishwasher to supplement the family income. Li pursued graduate study at the California Institute of Technology, where she received a Master of Science in electrical engineering in 2001 and a Doctor of Philosophy in electrical engineering in 2005. Li completed her dissertation, "Visual Recognition: Computational Models and Human Psychophysics", under the primary supervision of Pietro Perona and secondary supervision of Christof Koch. Her graduate studies were supported by the National Science Foundation Graduate Research Fellowship and The Paul & Daisy Soros Fellowships for New Americans. == Career and research == From 2005 to 2006, Li was an assistant professor in the Electrical and Computer Engineering Department at the University of Illinois Urbana-Champaign, and from 2007 to 2009, she was an assistant professor in the Computer Science Department at Princeton University. She joined Stanford in 2009 as an assistant professor, and was promoted to associate professor with tenure in 2012, and then full professor in 2018. At Stanford, Li served as the director of Stanford Artificial Intelligence Lab (SAIL) from 2013 to 2018. Her research has focused on computer vision, deep learning, and cognitive neuroscience, with over 300 peer-reviewed publications. She became the founding co-director of Stanford's University-level initiative - the Human-Centered AI Institute, along with co-director Dr. John Etchemendy, former provost of Stanford University. The institute aligns with Li's aims to advance AI research, education, policy, and practice to improve the human condition. While at Princeton in 2007, Li led the development of ImageNet, a massive visual database designed to advance object recognition in AI. The project involved labeling over 14 million images using Amazon Mechanical Turk and inspired the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which catalyzed progress in deep learning and led to dramatic improvements in image classification performance. The database addressed a key bottleneck in computer vision: the lack of large, annotated datasets for training machine learning models. Today, ImageNet is credited as a cornerstone innovation that underpins advancements in autonomous vehicles, facial recognition, and medical imaging. On her sabbatical from Stanford University from January 2017 to fall of 2018, Li joined Google Cloud as its Chief Scientist of AI/ML and Vice President. At Google, her team focused on democratizing AI technology and lowering the barrier for entrance to businesses and developers, including the developments of products like AutoML. In September 2017, Google secured a contract from the Department of Defense called Project Maven, which aimed to use AI techniques to interpret images captured by drone cameras. Google told employees who protested the company's work on Project Maven that their role was "specifically scoped to be for non-offensive purposes". In June 2018, Google told employees it would not seek renewal of the contract. In internal emails which were later leaked to reporters, Li expressed enthusiasm for the Google Cloud role in Project Maven, but warned against mentioning its AI component, saying that military AI is linked in the public mind with the danger of autonomous weapons. Asked about those leaked emails, Li told The New York Times, "I believe in human-centered AI to benefit people in positive and benevolent ways. It is deeply against my principles to work on any project that I think is to weaponize AI." In the fall of 2018, Li left Google and returned to Stanford University to continue her professorship. In 2023, Li co-led the launch of the RAISE-Health (Responsible AI for Safe and Equitable Health) initiative at Stanford University in collaboration with Stanford medicine. The initiative aims to develop frameworks for the responsible use of artificial intelligence in healthcare, including clinical care, biomedical research, and patient safety. According to her Stanford profile, she has been on partial academic leave from January 2024 through the end of 2025 to focus on entrepreneurial ventures. In 2024, Li said there was a disparity between private-sector investment in AI and support for academic and government research, and called for greater public funding for scientific uses of the technology and for studying its risks. Li is also known for her non-profit work as the co-founder and chairperson of nonprofit organization AI4ALL, whose mission is to educate the next generation of AI technologists, thinkers and leaders by promoting diversity and inclusion through human-centered AI principles. The program was created in collaboration with Melinda French Gates and Jensen Huang. Prior to establishing AI4ALL in 2017, Li and her former student Olga Russakovsky, currently an assistant professor in Princeton University, co-founded and co-directed the precursor program at Stanford called SAILORS (Stanford AI Lab OutReach Summers). SAILORS was an annual summer camp at Stanford dedicated to 9th grade high school girls in AI education and research, established in 2015 till it changed its name to AI4ALL @Stanford in 2017. In 2018, AI4ALL has successfully launched five more summer programs in addition to Stanford, including Princeton University, Carnegie Mellon University, Boston University, University of California Berkeley, and Canada's Simon Fraser University. We are at a turning point. AI's influence continues to grow, but representation and inclusion of a diversity of researchers in the field does not. It's critical that we seize this moment to create structures that will support long-term, positive changes. This won't happen via a single mechanism or quick fix. It starts with early education and extends to the existing structures of power within academia, work cultures among current AI researchers, and gatekeeping functions of research publishing, to name a few levers of change. Li has been described as a "researcher bringing humanity to AI". Li was elected as a member of the American Academy of Arts and Sciences in 2021, the National Academy of Engineering in 2020, and the National Academy of Medicine in 2020. In a November 2023 interview with The Guardian, Li said that while she would not refer to herself as the "godmother

    Read more →
  • Retrieval-based Voice Conversion

    Retrieval-based Voice Conversion

    Retrieval-based Voice Conversion (RVC) is an open source voice conversion AI algorithm that enables realistic speech-to-speech transformations, accurately preserving the intonation and audio characteristics of the original speaker. == Overview == In contrast to text-to-speech systems such as ElevenLabs, RVC differs by providing speech-to-speech outputs instead. It maintains the modulation, timbre and vocal attributes of the original speaker, making it suitable for applications where emotional tone is crucial. The algorithm enables both pre-processed and real-time voice conversion with low latency. This real-time capability marks a significant advancement over previous AI voice conversion technologies, such as So-vits SVC. Its speed and accuracy have led many to note that its generated voices sound near-indistinguishable from "real life", provided that sufficient computational specifications and resources (e.g., a powerful GPU and ample RAM) are available when running it locally and that a high-quality voice model is used. == Technical foundation == Retrieval-based Voice Conversion (RVC) utilizes a hybrid approach that integrates feature extraction with retrieval-based synthesis. Instead of directly mapping source speaker features to the target speaker using statistical models, RVC retrieves relevant segments from a target speech database, aiming to enhance the naturalness and speaker fidelity of the converted speech. At a high level, the RVC system typically comprises three main components: (1) a content feature extractor, such as a phonetic posteriorgram (PPG) encoder or self-supervised models like HuBERT; (2) a vector retrieval module that searches a target voice database for the most similar speech units; and (3) a vocoder or neural decoder that synthesizes waveform output from the retrieved representations. The retrieval-based paradigm aims to mitigate the oversmoothing effect commonly observed in fully neural sequence-to-sequence models, potentially leading to more expressive and natural-sounding speech. Furthermore, with the incorporation of high-dimensional embeddings and k-nearest-neighbor search algorithms, the model can perform efficient matching across large-scale databases without significant computational overhead. Recent RVC frameworks have incorporated adversarial learning strategies and GAN-based vocoders, such as HiFi-GAN, to enhance synthesis quality. These integrations have been shown to produce clearer harmonics and reduce reconstruction errors. == Research developments == Research on RVC has recently explored the use of self-supervised learning (SSL) encoders such as wav2vec 2.0 and HuBERT to replace hand-engineered features like MFCCs. These encoders improve content preservation, especially when source and target speakers have dissimilar speaking styles or accents. Moreover, modern RVC models leverage vector quantization methods to discretize the acoustic space, improving both synthesis accuracy and generalization across unseen speakers. For example, retrieval-augmented VQ models can condition the synthesis stage on quantized speech tokens, which enhances controllability and style transfer. Despite its strengths, RVC still faces limitations related to database coverage, especially in real-time or few-shot settings. Inadequate diversity in the target voice corpus may lead to suboptimal retrieval or unnatural prosody. These advances demonstrate the viability of RVC as a strong alternative to conventional deep learning VC systems, balancing both flexibility and efficiency in diverse voice synthesis applications. == Training process == The training pipeline for retrieval-based voice conversion typically includes a preprocessing step where the target speaker's dataset is segmented and normalized. A pitch extractor such as librosa or DDSP-DDC may be used to obtain fundamental frequency (F0) features. During training, the model learns to map content features from the source speaker to the acoustic representation of the target speaker while maintaining pitch and prosody. The training objective often combines reconstruction loss with feature consistency loss across intermediate layers, and may incorporate cycle consistency loss to preserve speaker identity. Fine-tuning on small datasets is feasible due to the use of pre-trained models, particularly for the SSL encoder and content extractor components. This approach allows transfer learning to be applied effectively, enabling the model to converge faster and generalize better to unseen inputs. Most open implementations support batch training, gradient accumulation, and mixed-precision acceleration (e.g., FP16), especially when utilizing NVIDIA CUDA-enabled GPUs. == Real-time deployment == RVC systems can be deployed in real-time scenarios through WebUI interfaces and streaming audio frameworks. Optimizations include converting the inference graph to ONNX or TensorRT formats, reducing latency. Audio buffers are typically processed in chunks of 0.2–0.5 seconds to ensure minimal delay and seamless conversion. Cross-platform compatibility with tools such as OBS Studio and Voicemeeter enables integration into live streaming, video production, or virtual avatar environments. == Applications and concerns == The technology enables voice changing and mimicry, allowing users to create accurate models of others using only a negligible amount of minutes of clear audio samples. These voice models can be saved as .pth (PyTorch) files. While this capability facilitates numerous creative applications, it has also raised concerns about potential misuse as deepfake software for identity theft and malicious impersonation through voice calls. == Ethical and legal considerations == As with other deep generative models, the rise of RVC technology has led to increasing debate about copyright, consent, and authorship. While some jurisdictions may allow parody or fair use in creative contexts, impersonating living individuals without permission may infringe upon privacy and likeness rights. As a result, some platforms have begun issuing takedown notices against AI-generated voice content that closely mimics celebrities or musicians. === In pop culture === RVC inference has been used to create realistic depictions of song covers, such as replacing original vocals with characters like Twilight Sparkle and Mordecai to have them sing duets of popular music like "Airplanes" and "Somebody That I Used to Know." These AI-generated covers, which can sound strikingly similar to the voice imitated, have gained popularity on platforms like YouTube as humorous memes.

    Read more →
  • Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, and aesthetic ranking. == Algorithm == The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart. To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of N {\displaystyle N} image-caption pairs. Let the outputs from the text and image models be respectively v 1 , . . . , v N , w 1 , . . . , w N {\displaystyle v_{1},...,v_{N},w_{1},...,w_{N}} . Two vectors are considered "similar" if their dot product is large. The loss incurred on this batch is the multi-class N-pair loss, which is a symmetric cross-entropy loss over similarity scores: − 1 N ∑ i ln ⁡ e v i ⋅ w i / T ∑ j e v i ⋅ w j / T − 1 N ∑ j ln ⁡ e v j ⋅ w j / T ∑ i e v i ⋅ w j / T {\displaystyle -{\frac {1}{N}}\sum _{i}\ln {\frac {e^{v_{i}\cdot w_{i}/T}}{\sum _{j}e^{v_{i}\cdot w_{j}/T}}}-{\frac {1}{N}}\sum _{j}\ln {\frac {e^{v_{j}\cdot w_{j}/T}}{\sum _{i}e^{v_{i}\cdot w_{j}/T}}}} In essence, this loss function encourages the dot product between matching image and text vectors ( v i ⋅ w i {\displaystyle v_{i}\cdot w_{i}} ) to be high, while discouraging high dot products between non-matching pairs. The parameter T > 0 {\displaystyle T>0} is the temperature, which is parameterized in the original CLIP model as T = e − τ {\displaystyle T=e^{-\tau }} where τ ∈ R {\displaystyle \tau \in \mathbb {R} } is a learned parameter. Other loss functions are possible. For example, Sigmoid CLIP (SigLIP) proposes the following loss function: L = 1 N ∑ i , j ∈ 1 : N f ( ( 2 δ i , j − 1 ) ( e τ w i ⋅ v j + b ) ) {\displaystyle L={\frac {1}{N}}\sum _{i,j\in 1:N}f((2\delta _{i,j}-1)(e^{\tau }w_{i}\cdot v_{j}+b))} where f ( x ) = ln ⁡ ( 1 + e − x ) {\displaystyle f(x)=\ln(1+e^{-x})} is the negative log sigmoid loss, and the Dirac delta symbol δ i , j {\displaystyle \delta _{i,j}} is 1 if i = j {\displaystyle i=j} else 0. == CLIP models == While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well. === Image model === The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order. Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt (in the OpenCLIP model series by LAION). Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension". For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024, and for the ViTs, from 512 to 768. Its implementation of ViT was the same as the original one, with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm. Its implementation of ResNet was the same as the original one, with 3 modifications: In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by. There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it rect-2 blur pooling according to the terminology of ). This has the effect of blurring images before downsampling, for antialiasing. The final convolutional layer is followed by a multiheaded attention pooling. ALIGN a model with similar capabilities, trained by researchers from Google used EfficientNet, a kind of convolutional neural network. === Text model === The text encoding models used in CLIP are typically Transformers. In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention. Its architecture is the same as GPT-2. Like BERT, the text sequence is bracketed by two special tokens [SOS] and [EOS] ("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408. ALIGN used BERT of various sizes. == Dataset == === WebImageText === The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data. The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets. The dataset is private and has not been released to the public, and there is no further information on it. ==== Data preprocessing ==== For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by [0.48145466, 0.4578275, 0.40821073], and dividing by [0.26862954, 0.26130258, 0.27577711]. The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly whitens the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. If the input image does not have the same resolution as the native resolution (224×224 for all except ViT-L/14@336px, which has 336×336 resolution), then the input image is first scaled by bicubic interpolation, so that its shorter side is the same as the native resolution, then the central square of the image is cropped out. === Others === ALIGN used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset was constructed, but instead of complex filtering, they only applied a frequency-based filtering. Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B. == Training == In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs. All ViT models were trained on 224×224 image resolution. The ViT-L/14 was then boosted to 336×336 resolution by FixRes, resulting in a model. They found this was the best-performing model. In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen. == Applications == === Cross-modal retrieval === CLIP's cross-modal retrieval enables the alignment of visual and textual data in a shared latent space, allowing users to retrieve images based on text descriptions and vice versa, without the need for explicit image annotations. In text-to-image retrieval, users input descriptive text, and CLIP retrieves images with matching embeddings. In image-to-text retrieval, images are used to find related text content. CLIP’s ability to connect vis

    Read more →
  • MIT Computer Science and Artificial Intelligence Laboratory

    MIT Computer Science and Artificial Intelligence Laboratory

    Computer Science and Artificial Intelligence Laboratory (CSAIL) is a research institute at the Massachusetts Institute of Technology (MIT) formed by the 2003 merger of the Laboratory for Computer Science (LCS) and the Artificial Intelligence Laboratory (AI Lab). Housed within the Ray and Maria Stata Center, CSAIL is the largest on-campus laboratory as measured by research scope and membership. It is part of the Schwarzman College of Computing but is also overseen by the MIT Vice President of Research. == Research activities == CSAIL's research activities are organized around a number of semi-autonomous research groups, each of which is headed by one or more professors or research scientists. These groups are divided up into seven general areas of research: Artificial intelligence Computational biology Graphics and vision Language and learning Theory of computation Robotics Systems (includes computer architecture, databases, distributed systems, networks and networked systems, operating systems, programming methodology, and software engineering, among others) == History == Computing Research at MIT began with Vannevar Bush's research into a differential analyzer and Claude Shannon's electronic Boolean algebra in the 1930s, the wartime MIT Radiation Laboratory, the post-war Project Whirlwind and the Research Laboratory of Electronics (RLE), and MIT Lincoln Laboratory's SAGE in the early 1950s. At MIT, research in the field of artificial intelligence began in the late 1950s. === Project MAC === On July 1, 1963, Project MAC (the Project on Mathematics and Computation, later backronymed to Multiple Access Computer, Machine Aided Cognitions, or Man and Computer) was launched with a $2 million grant from the Defense Advanced Research Projects Agency (DARPA). Project MAC's original director was Robert Fano of MIT's Research Laboratory of Electronics (RLE). Fano decided to call MAC a "project" rather than a "laboratory" for reasons of internal MIT politics – if MAC had been called a laboratory, then it would have been more difficult to raid other MIT departments for research staff. The program manager responsible for the DARPA grant was J. C. R. Licklider, who had previously been at MIT conducting research in RLE, and would later succeed Fano as director of Project MAC. Project MAC would become famous for groundbreaking research in operating systems, artificial intelligence, and the theory of computation. Its contemporaries included Project Genie at Berkeley, the Stanford Artificial Intelligence Laboratory, and (somewhat later) University of Southern California's (USC's) Information Sciences Institute. An "AI Group" including Marvin Minsky (the director), John McCarthy (inventor of Lisp), and a talented community of computer programmers were incorporated into Project MAC. They were interested principally in the problems of vision, mechanical motion and manipulation, and language, which they view as the keys to more intelligent machines. In the 1960s and 1970s the AI Group developed a time-sharing operating system called Incompatible Timesharing System (ITS) which ran on PDP-6 and later PDP-10 computers. The early Project MAC community included Fano, Minsky, Licklider, Fernando J. Corbató, and a community of computer programmers and enthusiasts among others who drew their inspiration from former colleague John McCarthy. These founders envisioned the creation of a computer utility whose computational power would be as reliable as an electric utility. To this end, Corbató brought the first computer time-sharing system, Compatible Time-Sharing System (CTSS), with him from the MIT Computation Center, using the DARPA funding to purchase an IBM 7094 for research use. One of the early focuses of Project MAC would be the development of a successor to CTSS, Multics, which was to be the first high availability computer system, developed as a part of an industry consortium including General Electric and Bell Laboratories. In 1966, Scientific American featured Project MAC in the September thematic issue devoted to computer science, that was later published in book form. At the time, the system was described as having approximately 100 TTY terminals, mostly on campus but with a few in private homes. Only 30 users could be logged in at the same time. The project enlisted students in various classes to use the terminals simultaneously in problem solving, simulations, and multi-terminal communications as tests for the multi-access computing software being developed. === AI Lab and LCS === In the late 1960s, Minsky's artificial intelligence group was seeking more space, and was unable to get satisfaction from project director Licklider. Minsky found that although Project MAC as a single entity could not get the additional space he wanted, he could split off to form his own laboratory and then be entitled to more office space. As a result, the MIT AI Lab was formed in 1970, and many of Minsky's AI colleagues left Project MAC to join him in the new laboratory, while most of the remaining members went on to form the Laboratory for Computer Science. Talented programmers such as Richard Stallman, who used TECO to develop EMACS, flourished in the AI Lab during this time. Those researchers who did not join the smaller AI Lab formed the Laboratory for Computer Science and continued their research into operating systems, programming languages, distributed systems, and the theory of computation. Two professors, Hal Abelson and Gerald Jay Sussman, chose to remain neutral—their group was referred to variously as Switzerland and Project MAC for the next 30 years. Among much else, the AI Lab led to the invention of Lisp machines and their attempted commercialization by two companies in the 1980s: Symbolics and Lisp Machines Inc. === CSAIL === On the fortieth anniversary of Project MAC's establishment, July 1, 2003, LCS was merged with the AI Lab to form the MIT Computer Science and Artificial Intelligence Laboratory, or CSAIL. This merger created the largest laboratory (over 600 personnel) on the MIT campus. In 2018, CSAIL launched a five-year collaboration program with IFlytek, a company sanctioned the following year for allegedly using its technology for surveillance and human rights abuses in Xinjiang. In October 2019, MIT announced that it would review its partnerships with sanctioned firms such as iFlyTek and SenseTime. In April 2020, the agreement with iFlyTek was terminated. CSAIL moved from the School of Engineering to the newly formed Schwarzman College of Computing by February 2020. == Offices == From 1963 to 2004, Project MAC, LCS, the AI Lab, and CSAIL had their offices at 545 Technology Square, taking over more and more floors of the building over the years. In 2004, CSAIL moved to the new Ray and Maria Stata Center, which was built specifically to house it and other departments. == Outreach activities == The IMARA (from Swahili word for "power") group sponsors a variety of outreach programs that bridge the global digital divide. Its aim is to find and implement long-term, sustainable solutions which will increase the availability of educational technology and resources to domestic and international communities. These projects are run under the aegis of CSAIL and staffed by MIT volunteers who give training, install and donate computer setups in greater Boston, Massachusetts, Kenya, Native American Indian tribal reservations in the American Southwest such as the Navajo Nation, the Middle East, and Fiji Islands. The CommuniTech project strives to empower under-served communities through sustainable technology and education and does this through the MIT Used Computer Factory (UCF), providing refurbished computers to under-served families, and through the Families Accessing Computer Technology (FACT) classes, it trains those families to become familiar and comfortable with computer technology. == Notable researchers == (Including members and alumni of CSAIL's predecessor laboratories) MacArthur Fellows Tim Berners-Lee, Erik Demaine, Dina Katabi, Daniela L. Rus, Regina Barzilay, Peter Shor, Richard Stallman, and Joshua Tenenbaum Turing Award recipients Leonard M. Adleman, Fernando J. Corbató, Shafi Goldwasser, Butler W. Lampson, John McCarthy, Silvio Micali, Marvin Minsky, Ronald L. Rivest, Adi Shamir, Barbara Liskov, and Michael Stonebraker IJCAI Computers and Thought Award recipients Terry Winograd, Patrick Winston, David Marr, Gerald Jay Sussman, Rodney Brooks Rolf Nevanlinna Prize recipients Madhu Sudan, Peter Shor, Constantinos Daskalakis Gödel Prize recipients Shafi Goldwasser (two-time recipient), Silvio Micali, Maurice Herlihy, Charles Rackoff, Johan Håstad, Peter Shor, and Madhu Sudan Grace Murray Hopper Award recipients Robert Metcalfe, Shafi Goldwasser, Guy L. Steele, Jr., Richard Stallman, and W. Daniel Hillis Textbook authors Harold Abelson and Gerald Jay Sussman, Richard Stallman, Thomas H. Cormen, Charles E. Leiserson, Patrick Winston, Ronald L.

    Read more →
  • ICAD (software)

    ICAD (software)

    ICAD (Corporate history: ICAD, Inc., Concentra (name change at IPO in 1995), KTI (name change in 1998), Dassault Systèmes (purchase in 2001) () is a knowledge-based engineering (KBE) system that enables users to encode design knowledge using a semantic representation that can be evaluated for Parasolid output. ICAD has an open architecture that can utilize all the power and flexibility of the underlying language. KBE, as implemented via ICAD, received a lot of attention due to the remarkable results that appeared to take little effort. ICAD allowed one example of end-user computing that in a sense is unparalleled. Most ICAD developers were degreed engineers. Systems developed by ICAD users were non-trivial and consisted of highly complicated code. In the sense of end-user computing, ICAD was the first to allow the power of a domain tool to be in the hands of the user, at the same time being open to allow extensions as identified and defined by the domain expert or subject-matter expert (SME). A COE article looked at the resulting explosion of expectations (see AI winter), which were not sustainable. However, such a bubble burst does not diminish the existence of ability that would exist were expectations and use reasonable or properly managed. == History == The original implementation of ICAD was on a Lisp machine (Symbolics). Some of the principals involved with the development were Larry Rosenfeld, Avrum Belzer, Patrick M. O'Keefe, Philip Greenspun, and David F. Place. The time frame was 1984–85. ICAD started on special-purpose Symbolics Lisp hardware and was then ported to Unix when Common Lisp became portable to general-purpose workstations. The original domain for ICAD was mechanical design with many application successes. However, ICAD has found use in other domains, such as electrical design, shape modeling, etc. An example project could be wind tunnel design or the development of a support tool for aircraft multidisciplinary design. Further examples can be found in the presentations at the annual IIUG (International ICAD Users Group) that have been published in the KTI Vault (1999 through 2002). Boeing and Airbus used ICAD extensively to develop various components in the 1990s and early 21st century. As of 2003, ICAD was featured strongly in several areas as evidenced by the Vision & Strategy Product Vision and Strategy presentation. After 2003, ICAD use diminished. At the end of 2001, the KTI Company faced financial difficulties and laid off most of its best staff. They were eventually bought out by Dassault who effectively scuppered the ICAD product. See IIUG at COE, 2003 (first meeting due to Dassault by KTI) The ICAD system was very expensive, relatively, and was in the price range of high-end systems. Market dynamics couldn't support this as there may not have been sufficient differentiating factors between ICAD and the lower-end systems (or the promises from Dassault). KTI was absorbed by Dassault Systèmes and ICAD is no longer considered the go-forward tool for knowledge-based engineering (KBE) applications by that company. Dassault Systèmes is promoting a suite of tools oriented around version 5 of their popular CATIA CAD application, with Knowledgeware the replacement for ICAD. As of 2005, things were still a bit unclear. ICAD 8.3 was delivered. The recent COE Aerospace Conference had a discussion about the futures of KBE. One issue involves the stacking of 'meta' issues within a computer model. How this is resolved, whether by more icons or the availability of an external language, remains to be seen. The Genworks GDL product (including kernel technology from the Gendl Project) is the nearest functional equivalent to ICAD currently available. == Particulars == ICAD provided a declarative language (IDL) using New Flavors (never converted to Common Lisp Object System (CLOS)) that supported a mechanism for relating parts (defpart) via a hierarchical set of relationships. Technically, the ICAD Defpart was a Lisp macro; the ICAD defpart list was a set of generic classes that can be instantiated with specific properties depending upon what was represented. This defpart list was extendible via composited parts that represented domain entities. Along with the part-subpart relations, ICAD supported generic relations via the object modeling abilities of Lisp. Example applications of ICAD range from a small collection of defparts that represents a part or component to a larger collection that represents an assembly. In terms of power, an ICAD system, when fully specified, can generate thousands of instances of parts on a major assembly design. One example of an application driving thousands of instances of parts is that of an aircraft wing – where fastener type and placement may number in the thousands, each instance requiring evaluation of several factors driving the design parameters. == Futures (KBE, etc.) == One role for ICAD may be serving as the defining prototype for KBE which would require that we know more about what occurred the past 15 years (much information is tied up behind corporate firewalls and under proprietary walls). With the rise of functional programming languages (an example is Haskell) in the markets, perhaps some of the power attributable to Lisp may be replicated.

    Read more →