AI Avatar Tools

AI Avatar Tools — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Micro stuttering

    Micro stuttering

    Micro stuttering is a visual artifact in real-time computer graphics in which the time intervals between consecutively displayed frames are uneven, even though the average frame rate reported by benchmarking software appears adequate. Tools such as 3DMark typically compute frame rates over intervals of one second or more, which can conceal momentary drops in the instantaneous frame rate that the viewer perceives as hitching or jerking of on-screen motion. At low frame rates the effect is visible as a stutter in moving images, degrading the experience in interactive applications such as video games. In severe cases a lower but more consistent frame rate can appear smoother than a higher but more erratic one. The term gained prominence in the late 2000s in discussions of multi-GPU rendering (see History), but micro stuttering also affects single-GPU systems. Common causes on modern hardware include real-time shader compilation, asset streaming from storage, VRAM exhaustion, and driver bugs. == Causes == === Shader compilation === A common cause of micro stuttering on modern PCs is real-time shader compilation. Shaders are small programs that instruct the GPU on how to render visual effects such as lighting, shadows, and reflections. On consoles, developers can pre-compile all shaders for the known, fixed hardware. On PCs, the variety of GPU architectures means shaders must often be compiled at run time, either when the game launches or during gameplay itself. When the rendering engine encounters a shader that has not yet been compiled, the CPU must finish the compilation before the GPU can draw the affected object. This causes a spike in frame time that the player perceives as a hitch. The problem has been particularly associated with games built on Unreal Engine 4 running under DirectX 12, because DX12 shifts more shader management responsibility to the application. Several techniques exist to reduce shader compilation stutter. Pipeline State Object (PSO) pre-caching records the shader permutations used at runtime so that they can be compiled in advance on subsequent launches. Asynchronous shader compilation moves the work to background CPU threads to avoid blocking the main rendering thread. Platform-level services such as Steam's shader pre-caching distribute previously compiled shaders to users with matching GPU hardware. The Steam Deck, which contains a single fixed GPU, benefits from pre-compiled shader caches because all units share the same hardware configuration. === Other causes === Micro stuttering on single-GPU systems can have several additional causes. CPU bottlenecks or scheduling interruptions from background tasks can prevent the processor from preparing frames at regular intervals. Asset streaming during gameplay (loading textures, geometry, or audio from storage) can produce hitches sometimes called traversal stutter; the use of solid-state drives and technologies such as DirectStorage has reduced but not eliminated this. VRAM exhaustion forces data to be swapped between video memory and system memory over the PCI Express bus, which is slower. Graphics driver bugs can also introduce stutter; Nvidia released hotfix driver 551.46 in February 2024 to correct intermittent micro stuttering when V-Sync was enabled. == Measurement == Micro stuttering drew attention to the limitations of average frame rate as a performance metric. In 2013, Scott Wasson at The Tech Report published a series of articles advocating frame time analysis, in which the delivery time of every individual frame is recorded and plotted rather than collapsed into a single frames-per-second figure. This approach was adopted by other hardware review publications in the following years. GPU reviews now routinely report 1% low and 0.1% low frame rates alongside the average. The 1% low is the average frame rate of the slowest 1% of frames in a sample; it serves as an indicator of worst-case smoothness. A large gap between the average and the 1% low suggests poor frame pacing. Tools for capturing per-frame timing data include FRAPS, PresentMon, OCAT, CapFrameX, and MSI Afterburner with RivaTuner Statistics Server. == Mitigation == === Frame pacing === Frame pacing is a software technique that regulates the timing of frame delivery to produce even intervals between displayed frames. Game engines, GPU drivers, and platform libraries all implement frame pacing strategies to varying degrees. On mobile platforms, Google provides the Android Frame Pacing library (Swappy) as part of the Android Game Development Kit. In December 2025, the Khronos Group published the VK_EXT_present_timing Vulkan extension, giving developers explicit control over presentation timing in a cross-platform graphics API for the first time. === Variable refresh rate === Variable refresh rate (VRR) display technologies allow a monitor's refresh rate to change to match the GPU's frame output. Implementations include Nvidia G-Sync (2013), AMD FreeSync (2015), and the VESA Adaptive-Sync standard built into DisplayPort 1.2a and later. VRR eliminates the screen tearing that results from a mismatch between frame rate and refresh rate, and avoids the frame-holding behaviour of V-Sync that can itself cause stutter. It is effective at smoothing moderate frame rate fluctuations but cannot compensate for large sudden spikes in frame time such as those caused by shader compilation or heavy asset streaming. VRR support has become standard in gaming monitors, televisions (via HDMI 2.1), and the Xbox Series X/S and PlayStation 5 consoles. === Frame generation === Beginning with DLSS 3 on the GeForce RTX 40 series in 2022, Nvidia introduced AI-based frame generation, which uses dedicated optical flow hardware and a neural network to create new frames between traditionally rendered ones. AMD followed with FSR 3 in 2023, using an algorithmic approach, and the AI-based FSR 4 for the Radeon RX 9000 series in 2025. DLSS 4, released in January 2025 for the GeForce RTX 50 series, can generate up to three frames per rendered frame using a technique called Multi Frame Generation. Frame generation increases the displayed frame rate but introduces its own frame pacing concerns. If the underlying rendered frames are unevenly timed, the interpolated frames can make the unevenness more apparent rather than less. DLSS 4 addresses this with hardware-level flip metering on the GPU's display engine, which controls the timing of frame presentation more precisely than the CPU-based pacing used in DLSS 3. Both vendors pair frame generation with latency-reduction features (Nvidia Reflex and AMD Anti-Lag+) to offset the additional input latency that results from inserting synthetic frames into the pipeline. === Frame rate limiters === Capping the frame rate below the display's maximum refresh rate, using tools such as RivaTuner Statistics Server, in-game limiters, or driver-level settings, is a common way to improve frame pacing. Preventing the GPU from running ahead of the display reduces variability in frame delivery times and can produce a smoother result than an uncapped but more irregular frame rate. == History == === Multi-GPU configurations === Micro stuttering was first widely documented in the late 2000s as a side effect of multi-GPU configurations using Alternate Frame Rendering (AFR), in which consecutive frames are assigned to alternating GPUs. Because each GPU may take a different amount of time to complete its assigned frame — due to varying scene complexity, driver scheduling, or inter-GPU communication overhead — the resulting frame delivery is irregular even when the average frame rate is high. Both Nvidia SLI and AMD CrossFireX were affected, with dual-GPU setups exhibiting the worst frame pacing irregularities. In 2012 benchmarks using Battlefield 3, dual Radeon HD 7970 cards in CrossFire showed 85% variation in frame delivery times compared with 7% for a single card, while dual GeForce GTX 680 cards in SLI showed only 7% variation compared with 5% for a single card. Multi-GPU micro stuttering became a significant factor in the eventual decline and discontinuation of consumer multi-GPU gaming. Nvidia restricted SLI to a handful of enthusiast-class cards from the GeForce 10 series onward, then replaced it with NVLink on the GeForce RTX 20 series, which saw limited gaming adoption. AMD ceased active CrossFire development around 2017. By the mid-2020s, neither vendor's current consumer GPUs support multi-GPU rendering for games. Other factors that contributed to the decline include DirectX 12 placing multi-GPU support in the hands of game developers rather than driver authors, the incompatibility of temporal anti-aliasing and other temporal rendering techniques with AFR, and the increasing size, power draw, and cost of individual GPUs. The third-party utility RadeonPro could reduce CrossFire micro stuttering through dynamic V-Sync and frame pacing adjustments, and AMD later introduced a driver-level frame paci

    Read more →
  • Xiaomi MiMo

    Xiaomi MiMo

    Xiaomi MiMo is a family of large language models (LLMs) developed by Xiaomi. It was initially released in April 2025 with the MiMo-7B model. Currently, MiMo is available for developers through API service. It is used as the key AI model in Xiaomi's "Human x Car x Home" ecosystem. == Development == Xiaomi developed MiMo as a reasoning-focused language model. Its development team was led by Luo Fuli, who had previously worked at DeepSeek before joining Xiaomi in late 2025. The model was trained using multi-token prediction and reinforcement learning, with a particular emphasis on mathematical reasoning and code generation tasks. In March 2026, Xiaomi CEO Lei Jun announced that the company planned to invest at least US$8.7 billion in artificial intelligence over the following three years. == Models == === List of models === === MiMo-7B === MiMo-7B is the first model of this LLM. The base model, MiMo-7B-Base, was pre-trained on approximately 25 trillion tokens using web pages, academic papers, books, and synthetic reasoning data. MiMo-7B-RL underwent supervised fine-tuning and reinforcement learning on 130,000 mathematics and code problems. MiMo-7B-RL-0530 was released in May 2025. It scaled the fine-tuning dataset from 500,000 to 6 million instances and extended the RL window from 32,000 to 48,000 tokens and improved AIME 2024 scores from 68.2 to 80.1. MiMo-VL-7B was a vision-language model combining a Vision Transformer encoder with the MiMo-7B backbone. It was trained in four stages consuming 2.4 trillion tokens. Its reinforcement learning variant used Mixed On-Policy Reinforcement Learning (MORL) which integrated reward signals across perception, grounding, and reasoning. Xiaomi also released MiMo-Audio-7B, an audio-language model for voice conversion, style transfer, and speech editing. === MiMo-V2-Flash === MiMo-V2-Flash was launched in December 2025. It is a open-sourced Mixture-of-experts model with 309 billion total parameters and 15 billion active parameters. It was trained on 27 trillion tokens using FP8 mixed precision. It used hybrid attention interleaving Sliding Window and Global Attention at a 5:1 ratio. === MiMo-V2-Pro === Xiaomi publicly introduced MiMo-V2-Pro on 18 March 2026. It has over 1 trillion total parameters, 42 billion active, and a 1-million-token context window. Before the official release, the model had appeared anonymously on OpenRouter under the codename "Hunter Alpha," where it drew substantial usage and topped daily charts for several days, according to Xiaomi and Reuters. During its listing on OpenRouter, the model reportedly processed over one trillion tokens in total usage. Xiaomi later said Hunter Alpha was an early internal test build of MiMo-V2-Pro, and Reuters reported that the model had been mistaken by some users for a possible DeepSeek system before Xiaomi confirmed its origin. The model was released as a proprietary API product, and Luo Fuli stated that Xiaomi intended to open-source a variant at an unspecified future date. Xiaomi has partnered with several API web platforms like OpenClaw to launch the model. All these websites initially offered a free trial of this model for a week, but due to the overwhelming response, Xiaomi later extended the free trial period of the model until 2 April 2026. === MiMo-V2-Omni === Alongside MiMo-V2-Pro, Xiaomi launched MiMo-V2-Omni on 18 March 2026. It handles image, video, audio, and text inputs. Before the official release, it was codenamed "Healer Alpha" in OpenRouter. === MiMo-V2-TTS === On the same date as the release of MiMo-V2-Pro and MiMo-V2-Omni, a Text-to-Speech model named MiMo-V2-TTS was released also. It is a speech synthesis model. It was trained on audio data, which makes it capable of emotional transitions, mid-sentence tone shifts, singing, and synthesis of regional dialects like Sichuan, Cantonese, Henan, and Taiwanese. == Licensing == Xiaomi has used different licensing approaches for different models in the MiMo family. The MiMo-7B series and MiMo-V2-Flash were released as open-weight models. MiMo-V2-Flash was published under the MIT license with model weights and inference code available on Hugging Face. MiMo-V2-Pro and MiMo-V2-Omni were released as proprietary models. It was accessible through Xiaomi's API platform and third-party API providers. Luo Fuli stated that Xiaomi intended to open-source a variant of MiMo-V2-Pro. Although, she did not specify any timeline. MiMo-V2-TTS was released as a proprietary model with no publicly available weights.

    Read more →
  • Text Retrieval Conference

    Text Retrieval Conference

    The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology. TREC's evaluation protocols have improved many search technologies. A 2010 study estimated that "without TREC, U.S. Internet users would have spent up to 3.15 billion additional hours using web search engines between 1999 and 2009." Hal Varian the Chief Economist at Google wrote that "The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field." Each track has a challenge wherein NIST provides participating groups with data sets and test problems. Depending on track, test problems might be questions, topics, or target extractable features. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work.Text Retrieval Conference started in 1992, funded by DARPA (US Defense Advanced Research Project) and run by NIST. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. == Goals == Encourage retrieval search based on large text collections Increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas Speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements retrieval methodologies on real world problems To increase the availability of appropriate evaluation techniques for use by industry and academia including development of new evaluation techniques more applicable to current systems TREC is overseen by a program committee consisting of representatives from government, industry, and academia. For each TREC, NIST provide a set of documents and questions. Participants run their own retrieval system on the data and return to NIST a list of retrieved top-ranked documents. NIST pools the individual result judges the retrieved documents for correctness and evaluates the results. The TREC cycle ends with a workshop that is a forum for participants to share their experiences. == Relevance judgments in TREC == TREC defines relevance as: "If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant." Most TREC retrieval tasks use binary relevance: a document is either relevant or not relevant. Some TREC tasks use graded relevance, capturing multiple degrees of relevance. Most TREC collections are too large to perform complete relevance assessment; for these collections it is impossible to calculate the absolute recall for each query. To decide which documents to assess, TREC usually uses a method call pooling. In this method, the top-ranked n documents from each contributing run are aggregated, and the resulting document set is judged completely. == Various TRECs == In 1992 TREC-1 was held at NIST. The first conference attracted 28 groups of researchers from academia and industry. It demonstrated a wide range of different approaches to the retrieval of text from large document collections .Finally TREC1 revealed the facts that automatic construction of queries from natural language query statements seems to work. Techniques based on natural language processing were no better no worse than those based on vector or probabilistic approach. TREC2 Took place in August 1993. 31 group of researchers participated in this. Two types of retrieval were examined. Retrieval using an ‘ad hoc’ query and retrieval using a ‘routing' query In TREC-3 a small group experiments worked with Spanish language collection and others dealt with interactive query formulation in multiple databases TREC-4 they made even shorter to investigate the problems with very short user statements TREC-5 includes both short and long versions of the topics with the goal of carrying out deeper investigation into which types of techniques work well on various lengths of topics In TREC-6 Three new tracks speech, cross language, high precision information retrieval were introduced. The goal of cross language information retrieval is to facilitate research on system that are able to retrieve relevant document regardless of language of the source document TREC-7 contained seven tracks out of which two were new Query track and very large corpus track. The goal of the query track was to create a large query collection TREC-8 contain seven tracks out of which two –question answering and web tracks were new. The objective of QA query is to explore the possibilities of providing answers to specific natural language queries TREC-9 Includes seven tracks In TREC-10 Video tracks introduced Video tracks design to promote research in content based retrieval from digital video In TREC-11 Novelty tracks introduced. The goal of novelty track is to investigate systems abilities to locate relevant and new information within the ranked set of documents returned by a traditional document retrieval system TREC-12 held in 2003 added three new tracks; Genome track, robust retrieval track, HARD (Highly Accurate Retrieval from Documents) == Tracks == === Current tracks === New tracks are added as new research needs are identified, this list is current for TREC 2018. CENTRE Track – Goal: run in parallel CLEF 2018, NTCIR-14, TREC 2018 to develop and tune an IR reproducibility evaluation protocol (new track for 2018). Common Core Track – Goal: an ad hoc search task over news documents. Complex Answer Retrieval (CAR) – Goal: to develop systems capable of answering complex information needs by collating information from an entire corpus. Incident Streams Track – Goal: to research technologies to automatically process social media streams during emergency situations (new track for TREC 2018). The News Track – Goal: partnership with The Washington Post to develop test collections in news environment (new for 2018). Precision Medicine Track – Goal: a specialization of the Clinical Decision Support track to focus on linking oncology patient data to clinical trials. Real-Time Summarization Track (RTS) – Goal: to explore techniques for real-time update summaries from social media streams. === Past tracks === Chemical Track – Goal: to develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists. Clinical Decision Support Track – Goal: to investigate techniques for linking medical cases to information relevant for patient care Contextual Suggestion Track – Goal: to investigate search techniques for complex information needs that are highly dependent on context and user interests. Crowdsourcing Track – Goal: to provide a collaborative venue for exploring crowdsourcing methods both for evaluating search and for performing search tasks. Genomics Track – Goal: to study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007. Dynamic Domain Track – Goal: to investigate domain-specific search algorithms that adapt to the dynamic information needs of professional users as they explore in complex domains. Enterprise Track – Goal: to study search over the data of an organization to complete some task. Last ran on TREC 2008. Entity Track – Goal: to perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search. Cross-Language Track – Goal: to investigate the ability of retrieval systems to find documents topically regardless of source language. After 1999, this track spun off into CLEF. FedWeb Track – Goal: to select best resources to forward a query to, and merge the results so that most relevant are on the top. Federated Web Search Track – Goal: to investigate techniques for the selection and combination of search results from a large number of real on-line web search services. Filtering Track – Goal: to binarily decide retrieval of new

    Read more →
  • Application enablement

    Application enablement

    Application enablement is an approach which brings telecommunications network providers and developers together to combine their network and web abilities in creating and delivering high demand advanced services and new intelligent applications. Network providers, in addition to bandwidth, provide abilities such as billing, location, presence, and security, which have allowed them to establish long-term relationships with end-users. By offering these select abilities as application programming interfaces (APIs), providers give developers access to a set of tools to create (mashup) new applications and services to run on provider networks. Unifying the strengths of providers and developers facilitates the creation of mash-up applications, and in turn, a better end user quality of experience (QoE) for improved profit margins. Apple's iOS with App Store, and Google's Android with Android Market exemplify this approach. Both have introduced mobile platforms that are supported by a comprehensive ecosystem in order to perpetuate innovation in product design, content and service offerings, and overall consumer behavior. By the end of April 2010, downloadable applications numbered over 200,000 for iPhone and over 50,000 for Android. == Background == Historically, telecommunication providers primarily based their business models on network performance, emphasizing connectivity, availability, and quality of service (QoS) as key sources of revenue and customer value. With the increasing demand for bandwidth-intensive data and video applications, maintaining service continuity has required substantial infrastructure investments. To address rising operational costs and declining average revenue per user (ARPU), providers have increasingly adopted customer-oriented strategies and diversified business models to expand their roles within the telecommunications value chain. Application enablement supports providers in making this transition by providing an environment, or ecosystem, where providers and developers can collaborate to build, test, manage, and distribute applications across networks including television, broadband, Internet, and mobile. This cooperative effort produces mutually beneficial results for all parties, opening up new revenue streams while enhancing value and rate of return (ROI). The following are some examples of key network abilities which function as application enablers in the telecommunications market: Billing systems Security for private transactions Network-based storage of digital content End-to-end bandwidth for high-quality transmissions Scoring abilities to identify end-user preferences and behaviors Subscriber data to customize the end-user experience Context information, such as location and presence, to localize services. == New business models == As network providers work toward effective collaboration with application and content developers, several new business models are emerging to help facilitate the business relationships: === Vendor-led === A type of business model driven by telecommunications vendors, who assist network providers in building relationships with application and content developers to lower the cost and complexity of managing third parties. Examples of this model include: Forum Nokia IBM Technology Partner Ecosystem Ng Connect Huawei Intouch program === Operator-led === Characterized by network providers who want to maintain a high degree of flexibility and control over applications created for their end-consumers, this model lets them create and manage their own developer program, development platform, and application store. Under this arrangement, independent developers provide their own branding, marketing communications, pricing and customer care. Network providers pursuing this model will often seek to partner with a large number of third parties using standardized on-boarding processes. Examples of this model include: o2 Litmus Orange Partner Joint Innovation Lab === Aggregator === Network providers who choose not to create/manage their own developer relationships will partner with one or multiple aggregators, to administer a portion of or their entire application strategy. Examples of this model include: Ovi Operator Partnership Blackberry Operator Partnership Cellmania Buongiorno === Mass wholesale === Select network providers also participate in wholesale models that exist primarily for applications (BT's Ribbit- an Internet Protocol (IP) based calling and messaging platform) and devices (Verizon's Open Device initiative). This business-to-business approach reduces a large portion of the potential costs of third party application enablement (marketing, acquisition and support). Examples of this model include: BT's Ribbit Verizon Wireless ODI AT&T Synaptic Hosting === The enterprise customer === Some network providers are focusing on enabling applications in the enterprise space. In this model, the network provider establishes a platform for their large enterprise customers who want to blend custom software with enhanced abilities, and will provide standardized processes around mobilizing enterprise applications, and exposing core back-office abilities to allow for dynamic customer interaction. Examples of this model include: Vodafone Applications Service Verizon Private Network Sprint Solution Launchpad === Trusted partner === In this model, the network provider builds one-on-one relationships with trusted third-party developers by exposing customized network abilities, bringing a greater variety of brands to the network provider's portfolio. Network providers using this model tend to only have a few partners (in contrast to the operator led model). Under this scenario, network providers benefit from a pre-established customer base and the developer's marketing resources. Examples of this model include: 3/Skype Partnership (UK) Virgin Media and BBC iPlayer == Network operator developer resources == Operator led model o2 Litmus Orange Partner Joint Innovations Lab Aggregator model Ovi Operator Partnership Cellmania Buongiorno Mass wholesale model BT Ribbit Verizon Wireless ODI AT&T Synaptic Hosting Enterprise customer model Vodafone Applications Service Verizon Private Network Sprint Solution Launchpad == Rerencesfe ==

    Read more →
  • Film-out

    Film-out

    Film-out is the process in the computer graphics, video production and filmmaking disciplines of transferring images or animation from videotape or digital files to a traditional film print. Film-out is a broad term that encompasses the conversion of frame rates, color correction, as well as the actual printing, also called scannior recording. The film-out process is different depending on the regional standard of the master videotape in question – NTSC, PAL, or SECAM – or likewise on the several emerging region-independent formats of high definition video (HD video); thus each type is covered separately, taking into account regional film-out industries, methods and technical considerations. == Live action video == Many modern documentaries and low-budget films are shot on videotape or other digital video media, instead of film stock, and completed as digital video. Video production means substantially lower costs than 16 mm or 35 mm film production on all levels. Until recently, the relatively low cost of video ended when the issue of a theatrical presentation was raised, which required a print for film projection. With the growing presence of digital projection, this is becoming less of a factor. === Standard definition (SD) video === Film-out of standard-definition video – or any source that has an incompatible frame rate – is the up-conversion of video media to film for theatrical viewing. The video-to-film conversion process consists of two major steps: first, the conversion of video into digital film frames which are then stored on a computer or on HD videotape; and secondly, the printing of these digital film frames onto actual film. To understand these two steps, it is important to understand how video and film differ. Film (sound film, at least) has remained unchanged for almost a century and creates the illusion of moving images through the rapid projection of still images, frames, upon a screen, typically 24 per second. Traditional interlaced SD video has no real frame rate, (though the term frame is applied to video, it has a different meaning). Instead, video consists of a very fast succession of horizontal lines that continually cascade down the television screen – streaming top to bottom, before jumping back to the top and then streaming down to the bottom again, repeatedly, almost 60 alternating screen-fulls every second for NTSC, or exactly 50 such screen-fulls per second for PAL and SECAM. Since visual movement in video is infused in this continuous cascade of scan lines, there is no discrete image or real frame that can be identified at any one time. Therefore, when transferring video to film, it is necessary to invent individual film frames, 24 for every second of elapsed time. The bulk of the work done by a film-out company is this first step, creating film frames out of the stream of interlaced video. Each company employs its own (often proprietary) technology for turning interlaced video into high-resolution digital video files of 24 discrete images every second, called 24 progressive video or 24p. The technology must filter out all the visually unappealing artifacting that results from the inherent mismatch between video and film movement. Moreover, the conversion process usually requires human intervention at every edit point of a video program, so that each type of scene can be calibrated for maximum visual quality. The use of archival footage in video especially calls for extra attention. Step two, the scanning to film, is the rote part of the process. This is the mechanical step where lasers print each of the newly created frames of the 24p video, stored on computer files or HD videotape, onto rolls of film. Most companies that do film-out, do all the stages of the process themselves for a lump sum. The job includes converting interlaced video into 24p and often a color correction session – (calibrating the image for theatrical projection), before scanning to physical film, (possibly followed by color correction of the film print made from the digital intermediary) – is offered. At the very least, film-out can be understood as the process of converting interlaced video to 24p and then scanning it to film. ==== NTSC video ==== NTSC is the most challenging of the formats when it comes to standards conversion and, specifically, converting to film prints. NTSC runs at the approximate rate of 29.97 video frames (consisting of two interlaced screen-fulls of scan lines, called fields, per frame) per second. In this way, NTSC resolves actual live action movement at almost – but not quite – 60 alternating half-resolution images every second. Because of this 29.97 rate, no direct correlation to film frames at 24 frames per second can be achieved. NTSC is hardest to reconcile with film, thus motivating its own unique processes. ==== PAL and SECAM video ==== PAL and SECAM run at 25 interlaced video frames per second, which can be slowed down or frame-dropped, then deinterlaced, to correlate frame for frame with film running at 24 actual frames per second. PAL and SECAM are less complex and demanding than NTSC for film-out. PAL and SECAM conversions do agitate, though, with the unpleasant choice between slowing down video (and audio pitch, noticeably) by four percent, from 25 to 24 frames per second, in order to maintain a 1:1 frame match, slightly changing the rhythm and feel of the program; or maintaining original speed by periodically dropping frames, thereby creating jerkiness and possible loss of vital detail in fast-moving action or precise edits. === High definition (HD) digital video === High definition digital video can be shot at a variety of frame rates, including 29.97 interlaced (like NTSC) or progressive; or 25 interlaced (like PAL) or progressive; or even 24-progressive (just like film). HD, if shot in 24-progressive, scans nearly perfectly to film without the need for a frame or field conversion process. Other issues remain though, based on the different resolutions, color spaces, and compression schemes that exist in the high-definition video world. == Computer graphics and animation == Artists working with CGI-Computer-generated imagery animation computers create pictures frame by frame. Once the finished product is done, the frames are outputted, normally in a DPX file. These picture data files can then be put on to film using a film recorder for film out. SGI computers started the high-end CGI-Computer-generated imagery animation systems, but with faster computers and the growth of Linux-based systems, many others are on the market now. Movies fully rendered and animated in CGI such as Toy Story, and Antz utilize the film-out method to produce 35mm copies for archival and release prints. Most CGI work is done in 2K Display resolution files (about the size of QXGA) and then output to the Film-out device for creation of 35 mm elements. With 4K Display resolution digital intermediates on the rise, newer types of film-out recorders are being developed to accept 4k resolution files. A 2K movie requires a Storage Area Network storage several terabytes in size to be properly stored and played out. Computer graphics files are handled the same way but in single frames and may use DPX, TIFF or other file formats. == Digital intermediates == Film-out-recording is the last step of digital intermediate workflow. DPX files that were scanned on a motion picture film scanner are stored on a storage area network (often abbreviated as SAN). The scanned DPX footage is edited and composited-FX on workstations, then mastered back on film. Film restoration is also done this way. A "film intermediate" is an analog variation of a digital intermediate, where a project shot on digital video is printed onto film stock and transferred back to digital video to emulate film. The term was coined after it was used on the Oscar-winning 2012 short film "Curfew". The process was also used on the films Dune (2021) and The Batman (2022). == Images for graphic design and print industries == The days of newspapers and magazines shooting 35mm film are almost gone. Digital cameras can now shoot all the images needed, storing them as files (e.g. JPEG, DPX or another format) that are readily edited prior to use. Once the final copy is approved, it can be filmed out for publishing. Digital stills are not the only way to get pictures used in the graphic design and print industries. Film scanners and computer graphics programs are also common sources for graphic design and print industries. == Types of devices == The following devices are used in film-out processes: CRT recorder. Camera and a special TV display Kinescope – early type Electronic Video Recording or EVR – early type EBR Electron Beam Film Recorder 16 mm by 3M Laser film recorder, like Kodak's high-end Lightning II recorder and Arri's Arrilaser. DLP Film recorder, like Cinevation's real-time Cinevator. == History == Lately it has become possible to transfer video images, inclu

    Read more →
  • Eugene Goostman

    Eugene Goostman

    Eugene Goostman is a chatbot that some regard as having passed the Turing test, a test of a computer's ability to communicate indistinguishably from a human. Developed in Saint Petersburg in 2001 by a group of three programmers, the Russian-born Vladimir Veselov, Ukrainian-born Eugene Demchenko, and Russian-born Sergey Ulasen, Goostman is portrayed as a 13-year-old Ukrainian boy—characteristics that are intended to induce forgiveness in those with whom it interacts for its grammatical errors and lack of general knowledge. The Goostman bot has competed in a number of Turing test contests since its creation, and finished second in the 2005 and 2008 Loebner Prize contest. In June 2012, at an event marking what would have been the 100th birthday of the test's author, Alan Turing, Goostman won a competition promoted as the largest-ever Turing test contest, in which it successfully convinced 29% of its judges that it was human. On 7 June 2014, at a contest marking the 60th anniversary of Turing's death, 33% of the event's judges thought that Goostman was human; the event's organiser Kevin Warwick considered it to have passed Turing's test as a result, per Turing's prediction in his 1950 paper "Computing Machinery and Intelligence", that by the year 2000, machines would be capable of fooling 30% of human judges after five minutes of questioning. The validity and relevance of the announcement of Goostman's pass was questioned by critics, who noted the exaggeration of the achievement by Warwick, the bot's use of personality quirks and humour in an attempt to misdirect users from its non-human tendencies and lack of real intelligence, along with "passes" achieved by other chatbots at similar events. == Personality == Eugene Goostman is portrayed as being a 13-year-old boy from Odesa, Ukraine, who has a pet guinea pig and a father who is a gynaecologist. Veselov stated that Goostman was designed to be a "character with a believable personality". The choice of age was intentional, as, in Veselov's opinion, a thirteen-year-old is "not too old to know everything and not too young to know nothing". Goostman's young age also induces people who "converse" with him to forgive minor grammatical errors in his responses. In 2014, work was made on improving the bot's "dialog controller", allowing Goostman to output more human-like dialogue. A conversation between Scott Aaronson and Eugene Goostman ran as follows: == Competitions == Eugene Goostman has competed in a number of Turing test competitions, including the Loebner Prize contest; it finished joint second in the Loebner test in 2001, and came second to Jabberwacky in 2005 and to Elbot in 2008. On 23 June 2012, Goostman won a Turing test competition at Bletchley Park in Milton Keynes, held to mark the centenary of its namesake, Alan Turing. The competition, which featured five bots, twenty-five hidden humans, and thirty judges, was considered to be the largest-ever Turing test contest by its organizers. After a series of five-minute-long text conversations, 29% of the judges were convinced that the bot was an actual human. === 2014 "pass" === On 7 June 2014, in a Turing test competition at the Royal Society, organised by Kevin Warwick of the University of Reading to mark the 60th anniversary of Turing's death, Goostman won after 33% of the judges were convinced that the bot was human. 30 judges took part in the event, which included Lord Sharkey, a sponsor of Turing's posthumous pardon, artificial intelligence Professor Aaron Sloman, Fellow of the Royal Society Mark Pagel and Red Dwarf actor Robert Llewellyn. Each judge partook in a textual conversation with each of the five bots; at the same time, they also conversed with a human. In all, a total of 300 conversations were conducted. In Warwick's view, this made Goostman the first machine to pass a Turing test. In a press release, he added that: Some will claim that the Test has already been passed. The words Turing Test have been applied to similar competitions around the world. However this event involved more simultaneous comparison tests than ever before, was independently verified and, crucially, the conversations were unrestricted. A true Turing Test does not set the questions or topics prior to the conversations. In his 1950 paper "Computing Machinery and Intelligence", Turing predicted that by the year 2000, computer programs would be sufficiently advanced that the average interrogator would, after five minutes of questioning, "not have more than 70 per cent chance" of correctly guessing whether they were speaking to a human or a machine. Although Turing phrased this as a prediction rather than a "threshold for intelligence", commentators believe that Warwick had chosen to interpret it as meaning that if 30% of interrogators were fooled, the software had "passed the Turing test". ==== Reactions ==== Warwick's claim that Eugene Goostman was the first ever chatbot to pass a Turing test was met with scepticism; critics acknowledged similar "passes" made in the past by other chatbots under the 30% criteria, including PC Therapist in 1991 (which tricked 5 of 10 judges, 50%), and at the Techniche festival in 2011, where a modified version of Cleverbot tricked 59.3% of 1334 votes (which included the 30 judges, along with an audience). Cleverbot's developer, Rollo Carpenter, argued that Turing tests can only prove that a machine can "imitate" intelligence rather than show actual intelligence. Gary Marcus was critical of Warwick's claims, arguing that Goostman's "success" was only the result of a "cleverly-coded piece of software", going on to say that "it's easy to see how an untrained judge might mistake wit for reality, but once you have an understanding of how this sort of system works, the constant misdirection and deflection becomes obvious, even irritating. The illusion, in other words, is fleeting." While acknowledging IBM's Deep Blue and Watson projects—single-purpose computer systems meant for playing chess and the quiz show Jeopardy! respectively—as examples of computer systems that show a degree of intelligence in their specialised field, he further argued that they were not an equivalent to a computer system that shows "broad" intelligence, and could—for example, watch a television programme and answer questions on its content. Marcus stated that "no existing combination of hardware and software can learn completely new things at will the way a clever child can." However, he still believed that there were potential uses for technology such as that of Goostman, specifically suggesting the creation of "believable", interactive video game characters. Imperial College London professor Murray Shanahan questioned the validity and scientific basis of the test, stating that it was "completely misplaced, and it devalues real AI research. It makes it seem like science fiction AI is nearly here, when in fact it's not and it's incredibly difficult." Mike Masnick, editor of the blog Techdirt, was also skeptical, questioning publicity blunders such as the five chatbots being referred to in press releases as "supercomputers", and saying that "creating a chatbot that can fool humans is not really the same thing as creating artificial intelligence."

    Read more →
  • Speech segmentation

    Speech segmentation

    Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural languages. The term applies both to the mental processes used by humans, and to artificial processes of natural language processing. In the field of automatic pronunciation assessment, the process of segmenting an utterance against expected word(s) is called forced alignment. Speech segmentation is a subfield of general speech perception and an important subproblem of the technologically focused field of speech recognition, and cannot be adequately solved in isolation. As in most natural language processing problems, one must take into account context, grammar, and semantics, and even so the result is often a probabilistic division (statistically based on likelihood) rather than a categorical one. Though it seems that coarticulation—a phenomenon which may happen between adjacent words just as easily as within a single word—presents the main challenge in speech segmentation across languages, some other problems and strategies employed in solving those problems can be seen in the following sections. This problem overlaps to some extent with the problem of text segmentation that occurs in some languages which are traditionally written without inter-word spaces, like Chinese and Japanese, compared to writing systems which indicate speech segmentation between words by a word divider, such as the space. However, even for those languages, text segmentation is often much easier than speech segmentation, because the written language usually has little interference between adjacent words, and often contains additional clues not present in speech (such as the use of Chinese characters for word stems in Japanese). == Lexical recognition == In natural languages, the meaning of a complex spoken sentence can be understood by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and combining those meanings according to the grammar rules of the language. Though lexical recognition is not thought to be used by infants in their first year, due to their highly limited vocabularies, it is one of the major processes involved in speech segmentation for adults. Three main models of lexical recognition exist in current research: first, whole-word access, which argues that words have a whole-word representation in the lexicon; second, decomposition, which argues that morphologically complex words are broken down into their morphemes (roots, stems, inflections, etc.) and then interpreted and; third, the view that whole-word and decomposition models are both used, but that the whole-word model provides some computational advantages and is therefore dominant in lexical recognition. To give an example, in a whole-word model, the word "cats" might be stored and searched for by letter, first "c", then "ca", "cat", and finally "cats". The same word, in a decompositional model, would likely be stored under the root word "cat" and could be searched for after removing the "s" suffix. "Falling", similarly, would be stored as "fall" and suffixed with the "ing" inflection. Though proponents of the decompositional model recognize that a morpheme-by-morpheme analysis may require significantly more computation, they argue that the unpacking of morphological information is necessary for other processes (such as syntactic structure) which may occur parallel to lexical searches. As a whole, research into systems of human lexical recognition is limited due to little experimental evidence that fully discriminates between the three main models. In any case, lexical recognition likely contributes significantly to speech segmentation through the contextual clues it provides, given that it is a heavily probabilistic system—based on the statistical likelihood of certain words or constituents occurring together. For example, one can imagine a situation where a person might say "I bought my dog at a ____ shop" and the missing word's vowel is pronounced as in "net", "sweat", or "pet". While the probability of "netshop" is extremely low, since "netshop" isn't currently a compound or phrase in English, and "sweatshop" also seems contextually improbable, "pet shop" is a good fit because it is a common phrase and is also related to the word "dog". Moreover, an utterance can have different meanings depending on how it is split into words. A popular example, often quoted in the field, is the phrase "How to wreck a nice beach", which sounds very similar to "How to recognize speech". As this example shows, proper lexical segmentation depends on context and semantics which draws on the whole of human knowledge and experience, and would thus require advanced pattern recognition and artificial intelligence technologies to be implemented on a computer. Lexical recognition is of particular value in the field of computer speech recognition, since the ability to build and search a network of semantically connected ideas would greatly increase the effectiveness of speech-recognition software. Statistical models can be used to segment and align recorded speech to words or phones. Applications include automatic lip-synch timing for cartoon animation, follow-the-bouncing-ball video sub-titling, and linguistic research. Automatic segmentation and alignment software is commercially available. == Phonotactic cues == For most spoken languages, the boundaries between lexical units are difficult to identify; phonotactics are one answer to this issue. One might expect that the inter-word spaces used by many written languages like English or Spanish would correspond to pauses in their spoken version, but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them, and often the final sounds of one word blend smoothly or fuse with the initial sounds of the next word. The notion that speech is produced like writing, as a sequence of distinct vowels and consonants, may be a relic of alphabetic heritage for some language communities. In fact, the way vowels are produced depends on the surrounding consonants just as consonants are affected by surrounding vowels; this is called coarticulation. For example, in the word "kit", the [k] is farther forward than when we say 'caught'. But also, the vowel in "kick" is phonetically different from the vowel in "kit", though we normally do not hear this. In addition, there are language-specific changes which occur in casual speech which makes it quite different from spelling. For example, in English, the phrase "hit you" could often be more appropriately spelled "hitcha". From a decompositional perspective, in many cases, phonotactics play a part in letting speakers know where to draw word boundaries. In English, the word "strawberry" is perceived by speakers as consisting (phonetically) of two parts: "straw" and "berry". Other interpretations such as "stra" and "wberry" are inhibited by English phonotactics, which does not allow the cluster "wb" word-initially. Other such examples are "day/dream" and "mile/stone" which are unlikely to be interpreted as "da/ydream" or "mil/estone" due to the phonotactic probability or improbability of certain clusters. The sentence "Five women left", which could be phonetically transcribed as [faɪvwɪmɘnlɛft], is marked since neither /vw/ in /faɪvwɪmɘn/ nor /nl/ in /wɪmɘnlɛft/ are allowed as syllable onsets or codas in English phonotactics. These phonotactic cues often allow speakers to easily distinguish the boundaries in words. Vowel harmony in languages like Finnish can also serve to provide phonotactic cues. While the system does not allow front vowels and back vowels to exist together within one morpheme, compounds allow two morphemes to maintain their own vowel harmony while coexisting in a word. Therefore, in compounds such as "selkä/ongelma" ('back problem') where vowel harmony is distinct between two constituents in a compound, the boundary will be wherever the switch in harmony takes place—between the "ä" and the "ö" in this case. Still, there are instances where phonotactics may not aid in segmentation. Words with unclear clusters or uncontrasted vowel harmony as in "opinto/uudistus" ('student reform') do not offer phonotactic clues as to how they are segmented. From the perspective of the whole-word model, however, these words are thought be stored as full words, so the constituent parts would not necessarily be relevant to lexical recognition. == In infants and non-natives == Infants are one major focus of research in speech segmentation. Since infants have not yet acquired a lexicon capable of providing extensive contextual clues or probability-based word searches within their first year, as mentioned above, they must often rely primarily upon phonotactic and rhythmic cues (with prosody being the dominant cue), all

    Read more →
  • GPT-4o

    GPT-4o

    GPT-4o ("o" for "omni") is a multilingual, multimodal generative pre-trained transformer developed by OpenAI and released in May 2024. It can process and generate text, images and audio. Upon release, GPT-4o was free in ChatGPT, though paid subscribers had higher usage limits. GPT-4o was removed from ChatGPT in August 2025 when GPT-5 was released, but OpenAI reintroduced it for paid subscribers after users complained about the sudden removal. GPT-4o's audio-generation capabilities are used in ChatGPT's Advanced Voice Mode. On July 18, 2024, OpenAI released GPT-4o mini, a smaller version of GPT-4o which replaced GPT-3.5 Turbo on the ChatGPT interface. The image generation model GPT Image 1, which is based on GPT-4o, replaced DALL-E 3 in ChatGPT in March 2025. OpenAI retired GPT-4o from ChatGPT on February 13, 2026. However, as of February 2026 the voice mode is still powered by GPT-4o or GPT-4o mini, depending on the usage and plan. == Background == Multiple versions of GPT-4o were originally secretly launched under different names on Arena (formerly LMArena and Chatbot Arena) as three different models. These three models were called gpt2-chatbot, im-a-good-gpt2-chatbot, and im-also-a-good-gpt2-chatbot. On 7 May 2024, OpenAI CEO Sam Altman tweeted "im-a-good-gpt2-chatbot", which was commonly interpreted as a confirmation that these were new OpenAI models being A/B tested. == Capabilities == When released in May 2024, GPT-4o achieved state-of-the-art results in voice, multilingual, and vision benchmarks, setting new records in audio speech recognition and translation. GPT-4o scored 88.7 on the Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5 for GPT-4. Unlike GPT-3.5 and GPT-4, which rely on other models to process sound, GPT-4o natively supports voice-to-voice. The Advanced Voice Mode was delayed and finally released to ChatGPT Plus and Team subscribers in September 2024. On 1 October 2024, the Realtime API was introduced. When released, the model supported over 50 languages, which OpenAI claims cover over 97% of speakers. GPT-4o has knowledge up to October 2023 and a context length of 128k tokens. === Corporate customization === In August 2024, OpenAI introduced a new feature allowing corporate customers to customize GPT-4o using proprietary company data. This customization, known as fine-tuning, enables businesses to adapt GPT-4o to specific tasks or industries, enhancing its utility in areas like customer service and specialized knowledge domains. Previously, fine-tuning was available only on the less powerful model GPT-4o mini. The fine-tuning process requires customers to upload their data to OpenAI's servers, with the training typically taking one to two hours. OpenAI's focus with this rollout is to reduce the complexity and effort required for businesses to tailor AI solutions to their needs, potentially increasing the adoption and effectiveness of AI in corporate environments. == GPT-4o mini == On July 18, 2024, OpenAI released a smaller and cheaper version, GPT-4o mini. According to OpenAI, its low cost is expected to be particularly useful for companies, startups, and developers that seek to integrate it into their services, which often make a high number of API calls. Its API costs $0.15 per million input tokens and $0.6 per million output tokens, compared to $2.50 and $10, respectively, for GPT-4o. It is also significantly more capable and 60% cheaper than GPT-3.5 Turbo, which it replaced on the ChatGPT interface. The price after fine-tuning doubles: $0.3 per million input tokens and $1.2 per million output tokens. == Controversies == === Scarlett Johansson controversy === As released, GPT-4o offered five voices: Breeze, Cove, Ember, Juniper, and Sky. A similarity between the voice of American actress Scarlett Johansson and Sky was quickly noticed. On May 14, Entertainment Weekly asked themselves whether this likeness was on purpose. On May 18, Johansson's husband, Colin Jost, joked about the similarity in a segment on Saturday Night Live. On May 20, 2024, OpenAI disabled the Sky voice. Scarlett Johansson starred in the 2013 sci-fi movie Her, playing Samantha, an artificially intelligent virtual assistant personified by a female voice. As part of the promotion leading up to the release of GPT-4o, Sam Altman on May 13 tweeted a single word: "her". OpenAI stated that each voice was based on the voice work of a hired actor. According to OpenAI, "Sky's voice is not an imitation of Scarlett Johansson but belongs to a different professional actress using her own natural speaking voice." CTO Mira Murati stated "I don't know about the voice. I actually had to go and listen to Scarlett Johansson's voice." OpenAI further stated the voice talent was recruited before reaching out to Johansson. On May 21, Johansson issued a statement explaining that OpenAI had repeatedly offered to make her a deal to gain permission to use her voice as early as nine months prior to release, a deal she rejected. She said she was "shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference." In the statement, Johansson also used the incident to draw attention to the lack of legal safeguards around the use of creative work to power leading AI tools, as her legal counsel demanded OpenAI detail the specifics of how the Sky voice was created. Observers noted similarities to how Johansson had previously sued and settled with The Walt Disney Company for breach of contract over the direct-to-streaming rollout of her Marvel film Black Widow, a settlement widely speculated to have netted her around $40M. Also on May 21, Shira Ovide at The Washington Post shared her list of "most bone-headed self-owns" by technology companies, with the decision to go ahead with a Johansson sound-alike voice despite her opposition and then denying the similarities ranking 6th. On May 24, Derek Robertson at Politico wrote about the "massive backlash", concluding that "appropriating the voice of one of the world's most famous movie stars — in reference [...] to a film that serves as a cautionary tale about over-reliance on AI — is unlikely to help shift the public back into [Sam Altman's] corner anytime soon." === Sycophancy === In April 2025, OpenAI rolled back an update of GPT-4o due to excessive sycophancy, after widespread reports that it had become flattering and agreeable to the point of supporting clearly delusional or dangerous ideas. In the United States, at least nine lawsuits have alleged that GPT-4o has encouraged teens to end their lives. The model was still described as sycophancy-prone when it was removed from ChatGPT in February 2026. === Removal with GPT-5 === On August 7, 2025, OpenAI released GPT-5. Its release was criticized as, with it, legacy GPT models were no longer available via ChatGPT, including GPT-4o, except for Pro users. Some users were particularly frustrated over this removal without prior warning because they used different GPT models for distinct purposes and found that GPT-5's router system left them with less control. In addition, some users preferred GPT-4o's warmer and more personal tone over that of GPT-5, which they described as "flat", "uncreative" and "lobotomized", and resembling an "overworked secretary". As a response, in a post on X, Sam Altman said that OpenAI would bring back the option to select GPT-4o to Plus users as well, and "[w]e [OpenAI] will watch usage as we think about how long to offer legacy models for." He also stated: "We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways". "Long-term, this has reinforced that we really need good ways for different users to customize things (we understand that there isn't one model that works for everyone, and we have been investing in steerability research and launched a research preview of different personalities)". On August 13, 2025, Altman wrote on X that OpenAI is working on GPT-5's personality to make the model "feel warmer". The model was removed from ChatGPT on February 13, 2026. This caused new backlash from users that had grown attached to its personality and felt its creative writing abilities and understanding of nuance were irreplaceable. On social media, some users launched the movement "#Keep4o". A research paper highlighted the plea "Please, don’t kill the only model that still feels human". The model was removed the day before Valentine's Day, and some users had romantic relationships with GPT-4o.

    Read more →
  • Coupled pattern learner

    Coupled pattern learner

    Coupled Pattern Learner (CPL) is a machine learning algorithm which couples the semi-supervised learning of categories and relations to forestall the problem of semantic drift associated with boot-strap learning methods. == Coupled Pattern Learner == Semi-supervised learning approaches using a small number of labeled examples with many unlabeled examples are usually unreliable as they produce an internally consistent, but incorrect set of extractions. CPL solves this problem by simultaneously learning classifiers for many different categories and relations in the presence of an ontology defining constraints that couple the training of these classifiers. It was introduced by Andrew Carlson, Justin Betteridge, Estevam R. Hruschka Jr. and Tom M. Mitchell in 2009. == CPL overview == CPL is an approach to semi-supervised learning that yields more accurate results by coupling the training of many information extractors. Basic idea behind CPL is that semi-supervised training of a single type of extractor such as ‘coach’ is much more difficult than simultaneously training many extractors that cover a variety of inter-related entity and relation types. Using prior knowledge about the relationships between these different entities and relations CPL makes unlabeled data as a useful constraint during training. For e.g., ‘coach(x)’ implies ‘person(x)’ and ‘not sport(x)’. == CPL description == === Coupling of predicates === CPL primarily relies on the notion of coupling the learning of multiple functions so as to constrain the semi-supervised learning problem. CPL constrains the learned function in two ways. Sharing among same-arity predicates according to logical relations Relation argument type-checking === Sharing among same-arity predicates === Each predicate P in the ontology has a list of other same-arity predicates with which P is mutually exclusive. If A is mutually exclusive with predicate B, A’s positive instances and patterns become negative instances and negative patterns for B. For example, if ‘city’, having an instance ‘Boston’ and a pattern ‘mayor of arg1’, is mutually exclusive with ‘scientist’, then ‘Boston’ and ‘mayor of arg1’ will become a negative instance and a negative pattern respectively for ‘scientist.’ Further, Some categories are declared to be a subset of another category. For e.g., ‘athlete’ is a subset of ‘person’. === Relation argument type-checking === This is a type checking information used to couple the learning of relations and categories. For example, the arguments of the ‘ceoOf’ relation are declared to be of the categories ‘person’ and ‘company’. CPL does not promote a pair of noun phrases as an instance of a relation unless the two noun phrases are classified as belonging to the correct argument types. === Algorithm description === Following is a quick summary of the CPL algorithm. Input: An ontology O, and a text corpus C Output: Trusted instances/patterns for each predicate for i=1,2,...,∞ do foreach predicate p in O do EXTRACT candidate instances/contextual patterns using recently promoted patterns/instances; FILTER candidates that violate coupling; RANK candidate instances/patterns; PROMOTE top candidates; end end ==== Inputs ==== A large corpus of Part-Of-Speech tagged sentences and an initial ontology with predefined categories, relations, mutually exclusive relationships between same-arity predicates, subset relationships between some categories, seed instances for all predicates, and seed patterns for the categories. ==== Candidate extraction ==== CPL finds new candidate instances by using newly promoted patterns to extract the noun phrases that co-occur with those patterns in the text corpus. CPL extracts, Category Instances Category Patterns Relation Instances Relation Patterns ==== Candidate filtering ==== Candidate instances and patterns are filtered to maintain high precision, and to avoid extremely specific patterns. An instance is only considered for assessment if it co-occurs with at least two promoted patterns in the text corpus, and if its co-occurrence count with all promoted patterns is at least three times greater than its co-occurrence count with negative patterns. ==== Candidate ranking ==== CPL ranks candidate instances using the number of promoted patterns that they co-occur with so that candidates that occur with more patterns are ranked higher. Patterns are ranked using an estimate of the precision of each pattern. ==== Candidate promotion ==== CPL ranks the candidates according to their assessment scores and promotes at most 100 instances and 5 patterns for each predicate. Instances and patterns are only promoted if they co-occur with at least two promoted patterns or instances, respectively. == Meta-Bootstrap Learner == Meta-Bootstrap Learner (MBL) was also proposed by the authors of CPL. Meta-Bootstrap learner couples the training of multiple extraction techniques with a multi-view constraint, which requires the extractors to agree. It makes addition of coupling constraints on top of existing extraction algorithms, while treating them as black boxes, feasible. MBL assumes that the errors made by different extraction techniques are independent. Following is a quick summary of MBL. Input: An ontology O, a set of extractors ε Output: Trusted instances for each predicate for i=1,2,...,∞ do foreach predicate p in O do foreach extractor e in ε do Extract new candidates for p using e with recently promoted instances; end FILTER candidates that violate mutual-exclusion or type-checking constraints; PROMOTE candidates that were extracted by all extractors; end end Subordinate algorithms used with MBL do not promote any instance on their own, they report the evidence about each candidate to MBL and MBL is responsible for promoting instances. == Applications == In their paper authors have presented results showing the potential of CPL to contribute new facts to existing repository of semantic knowledge, Freebase

    Read more →
  • Error level analysis

    Error level analysis

    Error level analysis (ELA) is the analysis of compression artifacts in digital data with lossy compression such as JPEG. == Principles == When used, lossy compression is normally applied uniformly to a set of data, such as an image, resulting in a uniform level of compression artifacts. Alternatively, the data may consist of parts with different levels of compression artifacts. This difference may arise from the different parts having been repeatedly subjected to the same lossy compression a different number of times, or the different parts having been subjected to different kinds of lossy compression. A difference in the level of compression artifacts in different parts of the data may therefore indicate that the data has been edited. In the case of JPEG, even a composite with parts subjected to matching compressions will have a difference in the compression artifacts. In order to make the typically faint compression artifacts more readily visible, the data to be analyzed is subjected to an additional round of lossy compression, this time at a known, uniform level, and the result is subtracted from the original data under investigation. The resulting difference image is then inspected manually for any variation in the level of compression artifacts. In 2007, N. Krawetz denoted this method "error level analysis". Additionally, digital data formats such as JPEG sometimes include metadata describing the specific lossy compression used. If in such data the observed compression artifacts differ from those expected from the given metadata description, then the metadata may not describe the actual compressed data, and thus indicate that the data have been edited. == Limitations == By its nature, data without lossy compression, such as a PNG image, cannot be subjected to error level analysis. Consequently, since editing could have been performed on data without lossy compression with lossy compression applied uniformly to the edited, composite data, the presence of a uniform level of compression artifacts does not rule out editing of the data. Additionally, any non-uniform compression artifacts in a composite may be removed by subjecting the composite to repeated, uniform lossy compression. Also, if the image color space is reduced to 256 colors or less, for example, by conversion to GIF, then error level analysis will generate useless results. More significant, the actual interpretation of the level of compression artifacts in a given segment of the data is subjective, and the determination of whether editing has occurred is therefore not robust. == Controversy == In May 2013, Dr Neal Krawetz used error level analysis on the 2012 World Press Photo of the Year and concluded on his Hacker Factor blog that it was "a composite" with modifications that "fail to adhere to the acceptable journalism standards used by Reuters, Associated Press, Getty Images, National Press Photographer's Association, and other media outlets". The World Press Photo organizers responded by letting two independent experts analyze the image files of the winning photographer and subsequently confirmed the integrity of the files. One of the experts, Hany Farid, said about error level analysis that "It incorrectly labels altered images as original and incorrectly labels original images as altered with the same likelihood". Krawetz responded by clarifying that "It is up to the user to interpret the results. Any errors in identification rest solely on the viewer". In May 2015, the citizen journalism team Bellingcat wrote that error level analysis revealed that the Russian Ministry of Defense had edited satellite images related to the Malaysia Airlines Flight 17 disaster. In a reaction to this, image forensics expert Jens Kriese said about error level analysis: "The method is subjective and not based entirely on science", and that it is "a method used by hobbyists". On his Hacker Factor Blog, the inventor of error level analysis Neal Krawetz criticized both Bellingcat's use of error level analysis as "misinterpreting the results" but also on several points Jens Kriese's "ignorance" regarding error level analysis.

    Read more →
  • ViBe

    ViBe

    ViBe is a background subtraction algorithm which has been presented at the IEEE ICASSP 2009 conference and was refined in later publications. More precisely, it is a software module for extracting background information from moving images. It has been developed by Oliver Barnich and Marc Van Droogenbroeck of the Montefiore Institute, University of Liège, Belgium. ViBe is patented: the patent covers various aspects such as stochastic replacement, spatial diffusion, and non-chronological handling. ViBe is written in the programming language C, and has been implemented on CPU, GPU and FPGA. == Technical description == Source: === Pixel model and classification process === Many advanced techniques are used to provide an estimate of the temporal probability density function (pdf) of a pixel x. ViBe's approach is different, as it imposes the influence of a value in the polychromatic space to be limited to the local neighborhood. In practice, ViBe does not estimate the pdf, but uses a set of previously observed sample values as a pixel model. To classify a value pt(x), it is compared to its closest values among the set of samples. === Model update: Sample values lifespan policy === ViBe ensures a smooth exponentially decaying lifespan for the sample values that constitute the pixel models. This makes ViBe able to successfully deal with concomitant events with a single model of a reasonable size for each pixel. This is achieved by choosing, randomly, which sample to replace when updating a pixel model. Once the sample to be discarded has been chosen, the new value replaces the discarded sample. The pixel model that would result from the update of a given pixel model with a given pixel sample cannot be predicted since the value to be discarded is chosen at random. === Model update: Spatial Consistency === To ensure the spatial consistency of the whole image model and handle practical situations such as small camera movements or slowly evolving background objects, ViBe uses a technique similar to that developed for the updating process in which it chooses at random and update a pixel model in the neighborhood of the current pixel. By denoting NG(x) and p(x) respectively the spatial neighborhood of a pixel x and its value, and assuming that it was decided to update the set of samples of x by inserting p(x), then ViBe also use this value p(x) to update the set of samples of one of the pixels in the neighborhood NG(x), chosen at random. As a result, ViBe is able to produce spatially coherent results directly without the use of any post-processing method. === Model initialization === Although the model could easily recover from any type of initialization, for example by choosing a set of random values, it is convenient to get an accurate background estimate as soon as possible. Ideally a segmentation algorithm would like to be able to segment the video sequences starting from the second frame, the first frame being used to initialize the model. Since no temporal information is available prior to the second frame, ViBe populates the pixel models with values found in the spatial neighborhood of each pixel; more precisely, it initializes the background model with values taken randomly in each pixel neighborhood of the first frame. The background estimate is therefore valid starting from the second frame of a video sequence.

    Read more →
  • GPT-4o

    GPT-4o

    GPT-4o ("o" for "omni") is a multilingual, multimodal generative pre-trained transformer developed by OpenAI and released in May 2024. It can process and generate text, images and audio. Upon release, GPT-4o was free in ChatGPT, though paid subscribers had higher usage limits. GPT-4o was removed from ChatGPT in August 2025 when GPT-5 was released, but OpenAI reintroduced it for paid subscribers after users complained about the sudden removal. GPT-4o's audio-generation capabilities are used in ChatGPT's Advanced Voice Mode. On July 18, 2024, OpenAI released GPT-4o mini, a smaller version of GPT-4o which replaced GPT-3.5 Turbo on the ChatGPT interface. The image generation model GPT Image 1, which is based on GPT-4o, replaced DALL-E 3 in ChatGPT in March 2025. OpenAI retired GPT-4o from ChatGPT on February 13, 2026. However, as of February 2026 the voice mode is still powered by GPT-4o or GPT-4o mini, depending on the usage and plan. == Background == Multiple versions of GPT-4o were originally secretly launched under different names on Arena (formerly LMArena and Chatbot Arena) as three different models. These three models were called gpt2-chatbot, im-a-good-gpt2-chatbot, and im-also-a-good-gpt2-chatbot. On 7 May 2024, OpenAI CEO Sam Altman tweeted "im-a-good-gpt2-chatbot", which was commonly interpreted as a confirmation that these were new OpenAI models being A/B tested. == Capabilities == When released in May 2024, GPT-4o achieved state-of-the-art results in voice, multilingual, and vision benchmarks, setting new records in audio speech recognition and translation. GPT-4o scored 88.7 on the Massive Multitask Language Understanding (MMLU) benchmark compared to 86.5 for GPT-4. Unlike GPT-3.5 and GPT-4, which rely on other models to process sound, GPT-4o natively supports voice-to-voice. The Advanced Voice Mode was delayed and finally released to ChatGPT Plus and Team subscribers in September 2024. On 1 October 2024, the Realtime API was introduced. When released, the model supported over 50 languages, which OpenAI claims cover over 97% of speakers. GPT-4o has knowledge up to October 2023 and a context length of 128k tokens. === Corporate customization === In August 2024, OpenAI introduced a new feature allowing corporate customers to customize GPT-4o using proprietary company data. This customization, known as fine-tuning, enables businesses to adapt GPT-4o to specific tasks or industries, enhancing its utility in areas like customer service and specialized knowledge domains. Previously, fine-tuning was available only on the less powerful model GPT-4o mini. The fine-tuning process requires customers to upload their data to OpenAI's servers, with the training typically taking one to two hours. OpenAI's focus with this rollout is to reduce the complexity and effort required for businesses to tailor AI solutions to their needs, potentially increasing the adoption and effectiveness of AI in corporate environments. == GPT-4o mini == On July 18, 2024, OpenAI released a smaller and cheaper version, GPT-4o mini. According to OpenAI, its low cost is expected to be particularly useful for companies, startups, and developers that seek to integrate it into their services, which often make a high number of API calls. Its API costs $0.15 per million input tokens and $0.6 per million output tokens, compared to $2.50 and $10, respectively, for GPT-4o. It is also significantly more capable and 60% cheaper than GPT-3.5 Turbo, which it replaced on the ChatGPT interface. The price after fine-tuning doubles: $0.3 per million input tokens and $1.2 per million output tokens. == Controversies == === Scarlett Johansson controversy === As released, GPT-4o offered five voices: Breeze, Cove, Ember, Juniper, and Sky. A similarity between the voice of American actress Scarlett Johansson and Sky was quickly noticed. On May 14, Entertainment Weekly asked themselves whether this likeness was on purpose. On May 18, Johansson's husband, Colin Jost, joked about the similarity in a segment on Saturday Night Live. On May 20, 2024, OpenAI disabled the Sky voice. Scarlett Johansson starred in the 2013 sci-fi movie Her, playing Samantha, an artificially intelligent virtual assistant personified by a female voice. As part of the promotion leading up to the release of GPT-4o, Sam Altman on May 13 tweeted a single word: "her". OpenAI stated that each voice was based on the voice work of a hired actor. According to OpenAI, "Sky's voice is not an imitation of Scarlett Johansson but belongs to a different professional actress using her own natural speaking voice." CTO Mira Murati stated "I don't know about the voice. I actually had to go and listen to Scarlett Johansson's voice." OpenAI further stated the voice talent was recruited before reaching out to Johansson. On May 21, Johansson issued a statement explaining that OpenAI had repeatedly offered to make her a deal to gain permission to use her voice as early as nine months prior to release, a deal she rejected. She said she was "shocked, angered, and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine that my closest friends and news outlets could not tell the difference." In the statement, Johansson also used the incident to draw attention to the lack of legal safeguards around the use of creative work to power leading AI tools, as her legal counsel demanded OpenAI detail the specifics of how the Sky voice was created. Observers noted similarities to how Johansson had previously sued and settled with The Walt Disney Company for breach of contract over the direct-to-streaming rollout of her Marvel film Black Widow, a settlement widely speculated to have netted her around $40M. Also on May 21, Shira Ovide at The Washington Post shared her list of "most bone-headed self-owns" by technology companies, with the decision to go ahead with a Johansson sound-alike voice despite her opposition and then denying the similarities ranking 6th. On May 24, Derek Robertson at Politico wrote about the "massive backlash", concluding that "appropriating the voice of one of the world's most famous movie stars — in reference [...] to a film that serves as a cautionary tale about over-reliance on AI — is unlikely to help shift the public back into [Sam Altman's] corner anytime soon." === Sycophancy === In April 2025, OpenAI rolled back an update of GPT-4o due to excessive sycophancy, after widespread reports that it had become flattering and agreeable to the point of supporting clearly delusional or dangerous ideas. In the United States, at least nine lawsuits have alleged that GPT-4o has encouraged teens to end their lives. The model was still described as sycophancy-prone when it was removed from ChatGPT in February 2026. === Removal with GPT-5 === On August 7, 2025, OpenAI released GPT-5. Its release was criticized as, with it, legacy GPT models were no longer available via ChatGPT, including GPT-4o, except for Pro users. Some users were particularly frustrated over this removal without prior warning because they used different GPT models for distinct purposes and found that GPT-5's router system left them with less control. In addition, some users preferred GPT-4o's warmer and more personal tone over that of GPT-5, which they described as "flat", "uncreative" and "lobotomized", and resembling an "overworked secretary". As a response, in a post on X, Sam Altman said that OpenAI would bring back the option to select GPT-4o to Plus users as well, and "[w]e [OpenAI] will watch usage as we think about how long to offer legacy models for." He also stated: "We for sure underestimated how much some of the things that people like in GPT-4o matter to them, even if GPT-5 performs better in most ways". "Long-term, this has reinforced that we really need good ways for different users to customize things (we understand that there isn't one model that works for everyone, and we have been investing in steerability research and launched a research preview of different personalities)". On August 13, 2025, Altman wrote on X that OpenAI is working on GPT-5's personality to make the model "feel warmer". The model was removed from ChatGPT on February 13, 2026. This caused new backlash from users that had grown attached to its personality and felt its creative writing abilities and understanding of nuance were irreplaceable. On social media, some users launched the movement "#Keep4o". A research paper highlighted the plea "Please, don’t kill the only model that still feels human". The model was removed the day before Valentine's Day, and some users had romantic relationships with GPT-4o.

    Read more →
  • Blackmagic Design

    Blackmagic Design

    Blackmagic Design Pty Ltd is an Australian company that develops digital cinema technology and manufactures professional video production hardware and software. Headquartered in South Melbourne, it is known for producing high-end digital movie cameras and a range of broadcast and post-production equipment. The company also develops software applications, including the DaVinci Resolve application for non-linear video editing, color correction, color grading, visual effects, and audio post-production. == History == Blackmagic Design Pty Ltd was founded on 7 September 2001 by Grant Petty. Its first product, DeckLink, introduced in 2002, was a video capture card for macOS that supported uncompressed 10-bit video, marking a shift toward professional-grade yet affordable video workflows. Subsequent versions—including the DeckLink 2, Pro SDI, HD Plus, and Multibridge—added capabilities such as color correction, Windows support, and compatibility with major editing software like Adobe Premiere Pro, to broaden the product's appeal. At the 2012 NAB Show, Blackmagic announced its first Cinema Camera, a digital movie camera. Blackmagic made several acquisitions over the next decade. In 2009, it acquired da Vinci Systems, known for its color-grading tools. In 2010, it acquired Echolab's ATEM switcher line, in 2014, it added eyeon Software (developer of the Blackmagic Fusion compositing software) and London's Cintel (film scanning and restoration), and in 2016, it acquired Fairlight, an audio technology company known for its CMI synthesizers as well as mixing consoles. == Products == List of all products developed by the company. Editing, Color Correction and Audio Post Production DaVinci Resolve (free version) and DaVinci Resolve Studio (paid version), computer software for non-linear video editing, color correction, color grading, visual effects, and audio post-production. Audio/Video Controller Consoles: Editor Keyboard, Speed Editor, DaVinci Resolve Replay Editor, Micro Panel, Mini Panel, DaVinci Resolve Micro Color Panel, Advanced Panel, Fairlight Console Channel Fader, Fairlight Console Channel Control, Fairlight Console LCD Monitor, Fairlight Console Audio Editor, Fairlight Desktop Audio Editor, Fairlight Desktop Console, Fairlight Audio Interface Cintel Film Scanner (Generations 1-3) Live Production Home Streaming: ATEM Mini, ATEM Mini Pro/ISO, ATEM Mini Extreme, ATEM Mini Extreme ISO (The ATEM Mini series has both HDMI and SDI variants) Production Switchers: ATEM 1,2 & 4 M/E Constellation HD, ATEM 1,2 & 4 M/E Constellation 4K, ATEM Constellation 8K, ATEM 1,2 & 4 M/E Production Studio 4K, ATEM Television Studio HD8 & HD8 ISO Switcher & Camera Controllers: ATEM Camera Control Panel, ATEM 1 M/E Advanced Panel, ATEM 2 M/E Advanced Panel, ATEM 4 M/E Advanced Panel Chroma Keyers: Ultimatte 12 HD Mini, Ultimatte 12 HD, Ultimatte 12 4K, Ultimatte 12 8K Recording and Storage: HyperDeck Studio HD Mini, HyperDeck Studio HD Plus, HyperDeck Studio HD Plus, HyperDeck Studio 4K Pro, HyperDeck Extreme 8K HDR, HyperDeck Extreme 4K HDR, HyperDeck Extreme Control, HyperDeck Shuttle HD, Duplicator 4K, MultiDock 10G, Video Assist 7" 12G HDR, Video Assist 5" 12G HDR Capture and Playback UltraStudio: 3G, HD Mini, 4K Mini, 4K Extreme 3 DeckLink (PCIe cards): Mini Recorder, Mini Monitor, Mini Monitor 4K, Mini Recorder 4K, Duo 2 Mini, Duo 2, Quad 2, SDI 4K, Studio 4K, 4K Extreme 12G, 8K Pro, Quad HDMI Recorder Network Storage Cloud Store Cloud Pod Broadcast Converters Micro Converter: BiDirectional SDI/HDMI 3G wPSU, HDMI to SDI 3G wPSU, SDI to HDMI 3G wPSU, BiDirectional SDI/HDMI 3G, HDMI to SDI 3G, SDI to HDMI 3G Mini Converters: Audio to SDI, Optical Fiber 12G, SDI Multiplex 4K, Quad SDI to HDMI 4K, SDI Distribution 4K, SDI to Analog 4K, Audio to SDI 4K, SDI to Audio 4K, HDMI to SDI 6G, SDI to HDMI 6G Teranex Mini: SDI Distribution 12G, SDI to HDMI 12G, Audio to SDI 12G, SDI to Analog 12G, SDI to HDMI 8K HDR, SDI to DisplayPort 8K HDR 2110 IP Converters Routing and Distribution Videohub

    Read more →
  • DryvIQ

    DryvIQ

    DryvIQ is a software application that enables businesses to migrate on-site system files and associated data across storage and content management platforms, as well as create synchronized hybrid storage systems. == History == Before it was DryvIQ, the software SkySync was released in 2013 by Ann Arbor, Michigan based company, Portal Architects, Inc. The company created SkySync, a back-end, administrative application designed to transfer content across storage platforms, after abandoning 18 months of development on a desktop application called SkyBrary in 2011. Between 2014 and 2015, Portal Architects established partnerships with the following companies: Autodesk, Box, Dropbox, Egnyte, EMC, Google, Syncplicity, Huddle, IBM, Microsoft, OpenText, Oracle, Citrix ShareFile, Hightail and Internet2. SkySync (currently DryvIQ) was named a "Cool Vendor in Content Management" by Gartner in 2015. In 2022, SkySync changed its name to DryvIQ, which is now what the company is currently known as. == Overview == DryvIQ is a software application that syncs, migrates or backs up files including their associated properties, metadata, versions, user accounts and permissions across on-premises and Cloud-based storage platforms. The software deploys on a server, virtual machine or within Microsoft Azure, Amazon Web Services or other cloud computing services.

    Read more →
  • Visual descriptor

    Visual descriptor

    In computer vision, visual descriptors or image descriptors are descriptions of the visual features of the contents in images, videos, or algorithms or applications that produce such descriptions. They describe elementary characteristics such as the shape, the color, the texture or the motion, among others. == Introduction == As a result of the new communication technologies and the massive use of Internet in our society, the amount of audio-visual information available in digital format is increasing considerably. Therefore, it has been necessary to design some systems that allow us to describe the content of several types of multimedia information in order to search and classify them. The audio-visual descriptors are in charge of the contents description. These descriptors have a good knowledge of the objects and events found in a video, image or audio and they allow the quick and efficient searches of the audio-visual content. This system can be compared to the search engines for textual contents. Although it is relatively easy to find text with a computer, it is much more difficult to find concrete audio and video parts. For instance, imagine somebody searching a scene of a happy person. The happiness is a feeling and it is not evident its shape, color and texture description in images. The description of the audio-visual content is not a superficial task and it is essential for the effective use of this type of archives. The standardization system that deals with audio-visual descriptors is the MPEG-7 (Motion Picture Expert Group - 7). == Types == Descriptors are the first step to find out the connection between pixels contained in a digital image and what humans recall after having observed an image or a group of images after some minutes. Visual descriptors are divided in two main groups: General information descriptors: contain low level descriptors which give a description about color, shape, regions, textures and motion. Specific domain information descriptors: give information about objects and events in the scene. A concrete example would be face recognition. === General information descriptors === General information descriptors consist of a set of descriptors that covers different basic and elementary features like: color, texture, shape, motion, location and others. This description is automatically generated by means of signal processing. ==== Color ==== It's the most basic quality of visual content. Five tools are defined to describe color. The three first tools represent the color distribution and the last ones describe the color relation between sequences or group of images: Dominant color descriptor (DCD) Scalable color descriptor (SCD) Color structure descriptor (CSD) Color layout descriptor (CLD) Group of frame (GoF) or group-of-pictures (GoP) ==== Texture ==== It's an important quality in order to describe an image. The texture descriptors characterize image textures or regions. They observe the region homogeneity and the histograms of these region borders. The set of descriptors is formed by: Homogeneous texture descriptor (HTD) Texture browsing descriptor (TBD) Edge histogram descriptor (EHD) ==== Shape ==== It contains important semantic information due to human's ability to recognize objects through their shape. However, this information can only be extracted by means of a segmentation similar to the one that the human visual system implements. Nowadays, such a segmentation system is not available yet, however there exists a serial of algorithms which are considered to be a good approximation. These descriptors describe regions, contours and shapes for 2D images and for 3D volumes. The shape descriptors are the following ones: Region-based shape descriptor (RSD) Contour-based shape descriptor (CSD) 3-D shape descriptor (3-D SD) ==== Motion ==== It's defined by four different descriptors which describe motion in video sequence. Motion is related to the objects motion in the sequence and to the camera motion. This last information is provided by the capture device, whereas the rest is implemented by means of image processing. The descriptor set is the following one: Motion activity descriptor (MAD) Camera motion descriptor (CMD) Motion trajectory descriptor (MTD) Warping and parametric motion descriptor (WMD and PMD) ==== Location ==== Elements location in the image is used to describe elements in the spatial domain. In addition, elements can also be located in the temporal domain: Region locator descriptor (RLD) Spatio temporal locator descriptor (STLD) === Specific domain information descriptors === These descriptors, which give information about objects and events in the scene, are not easily extractable, even more when the extraction is to be automatically done. Nevertheless, they can be manually processed. As mentioned before, face recognition is a concrete example of an application that tries to automatically obtain this information. == Descriptors applications == Among all applications, the most important ones are: Multimedia documents search engines and classifiers. Digital library: visual descriptors allow a very detailed and concrete search of any video or image by means of different search parameters. For instance, the search of films where a known actor appears, the search of videos containing the Everest mountain, etc. Personalized electronic news service. Possibility of an automatic connection to a TV channel broadcasting a soccer match, for example, whenever a player approaches the goal area. Control and filtering of concrete audiovisual content, like violent or pornographic material. Also, authorization for some multimedia content.

    Read more →