AI Email Message Generator

AI Email Message Generator — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • RFinder

    RFinder

    RFinder ("repeater finder") is a subscription-based website and mobile app. RFinder's main service is the World Wide Repeater Directory (WWRD), which is a directory of amateur radio repeaters. RFinder is the official repeater directory of several amateur radio associations. RFinder has listings for several amateur radio modes, including FM, D-STAR, DMR, and ATV. == World Wide Repeater Directory == Repeaters are listed in the directory along with its call sign, Maidenhead Locator System and GPS coordinates, transmit/receive offset ("split"), CTCSS and DCS squelch settings, and VoIP settings (IRLP and Echolink nodes). The directory has over 50,000 repeater listings in over 170 countries. === Website === The RFinder website has several search options including for routes. === Forums === RFinder user forums is for help and support for the app and hardware. === Mobile app === RFinder has mobile apps for Android and iOS. When using the mobile app, RFinder can display the distance to repeaters, based on the mobile device's current location. === ARRL Repeater Directory === The ARRL publishes the ARRL Repeater Directory which contains over 31,000 repeater listings for the US and Canada with listings provided by RFinder. == Subscription == RFinder requires a subscription. A one-year subscription is US$12.99. == Radio programming software == Some radio programming software applications can query RFinder and download repeater listing to program radios. Compatible software includes: CHIRP RT Systems == Radio associations == RFinder is the official repeater directory of the following associations: Amateur Radio Society Italy American Radio Relay League Cayman Amateur Radio Society Deutscher Amateur Radio Club Federacion Mexicana de Radio Experimentadores L’association Réseau des Émetteurs Français Lietuvos Radijo Mėgėjų Draugija Liga de Amadores Brasilieros de Radio Emissão Radio Amateurs of Canada Radio Society of Great Britain Rede dos Emissores Portugueses Unión de Radioaficionados Españoles

    Read more →
  • Scale space

    Scale space

    Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. The parameter t {\displaystyle t} in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about t {\displaystyle {\sqrt {t}}} have largely been smoothed away in the scale-space level at scale t {\displaystyle t} . The main type of scale space is the linear (Gaussian) scale space, which has wide applicability as well as the attractive property of being possible to derive from a small set of scale-space axioms. The corresponding scale-space framework encompasses a theory for Gaussian derivative operators, which can be used as a basis for expressing a large class of visual operations for computerized systems that process visual information. This framework also allows visual operations to be made scale invariant, which is necessary for dealing with the size variations that may occur in image data, because real-world objects may be of different sizes and in addition the distance between the object and the camera may be unknown and may vary depending on the circumstances. == Definition == The notion of scale space applies to signals of arbitrary numbers of variables. The most common case in the literature applies to two-dimensional images, which is what is presented here. Consider a given image f {\displaystyle f} where f ( x , y ) {\displaystyle f(x,y)} is the greyscale value of the pixel at position ( x , y ) {\displaystyle (x,y)} . The linear (Gaussian) scale-space representation of f {\displaystyle f} is a family of derived signals L ( x , y ; t ) {\displaystyle L(x,y;t)} defined by the convolution of f ( x , y ) {\displaystyle f(x,y)} with the two-dimensional Gaussian kernel g ( x , y ; t ) = 1 2 π t e − ( x 2 + y 2 ) / 2 t {\displaystyle g(x,y;t)={\frac {1}{2\pi t}}e^{-(x^{2}+y^{2})/2t}\,} such that L ( ⋅ , ⋅ ; t ) = g ( ⋅ , ⋅ ; t ) ∗ f ( ⋅ , ⋅ ) , {\displaystyle L(\cdot ,\cdot ;t)\ =g(\cdot ,\cdot ;t)f(\cdot ,\cdot ),} where the semicolon in the argument of L {\displaystyle L} implies that the convolution is performed only over the variables x , y {\displaystyle x,y} , while the scale parameter t {\displaystyle t} after the semicolon just indicates which scale level is being defined. This definition of L {\displaystyle L} works for a continuum of scales t ≥ 0 {\displaystyle t\geq 0} , but typically only a finite discrete set of levels in the scale-space representation would be actually considered. The scale parameter t = σ 2 {\displaystyle t=\sigma ^{2}} is the variance of the Gaussian filter and as a limit for t = 0 {\displaystyle t=0} the filter g {\displaystyle g} becomes an impulse function such that L ( x , y ; 0 ) = f ( x , y ) , {\displaystyle L(x,y;0)=f(x,y),} that is, the scale-space representation at scale level t = 0 {\displaystyle t=0} is the image f {\displaystyle f} itself. As t {\displaystyle t} increases, L {\displaystyle L} is the result of smoothing f {\displaystyle f} with a larger and larger filter, thereby removing more and more of the details that the image contains. Since the standard deviation of the filter is σ = t {\displaystyle \sigma ={\sqrt {t}}} , details that are significantly smaller than this value are to a large extent removed from the image at scale parameter t {\displaystyle t} , see the following figures and for graphical illustrations. === Why a Gaussian filter? === When faced with the task of generating a multi-scale representation one may ask: could any filter g of low-pass type and with a parameter t which determines its width be used to generate a scale space? The answer is no, as it is of crucial importance that the smoothing filter does not introduce new spurious structures at coarse scales that do not correspond to simplifications of corresponding structures at finer scales. In the scale-space literature, a number of different ways have been expressed to formulate this criterion in precise mathematical terms. The conclusion from several different axiomatic derivations that have been presented is that the Gaussian scale space constitutes the canonical way to generate a linear scale space, based on the essential requirement that new structures must not be created when going from a fine scale to any coarser scale. Conditions, referred to as scale-space axioms, that have been used for deriving the uniqueness of the Gaussian kernel include linearity, shift invariance, semi-group structure, non-enhancement of local extrema, scale invariance and rotational invariance. In the works, the uniqueness claimed in the arguments based on scale invariance has been criticized, and alternative self-similar scale-space kernels have been proposed. The Gaussian kernel is, however, a unique choice according to the scale-space axiomatics based on causality or non-enhancement of local extrema. === Alternative definition === Equivalently, the scale-space family can be defined as the solution of the diffusion equation (for example in terms of the heat equation), ∂ t L = 1 2 ∇ 2 L , {\displaystyle \partial _{t}L={\frac {1}{2}}\nabla ^{2}L,} with initial condition L ( x , y ; 0 ) = f ( x , y ) {\displaystyle L(x,y;0)=f(x,y)} . This formulation of the scale-space representation L means that it is possible to interpret the intensity values of the image f as a "temperature distribution" in the image plane and that the process that generates the scale-space representation as a function of t corresponds to heat diffusion in the image plane over time t (assuming the thermal conductivity of the material equal to the arbitrarily chosen constant ⁠1/2⁠). Although this connection may appear superficial for a reader not familiar with differential equations, it is indeed the case that the main scale-space formulation in terms of non-enhancement of local extrema is expressed in terms of a sign condition on partial derivatives in the 2+1-D volume generated by the scale space, thus within the framework of partial differential equations. Furthermore, a detailed analysis of the discrete case shows that the diffusion equation provides a unifying link between continuous and discrete scale spaces, which also generalizes to nonlinear scale spaces, for example, using anisotropic diffusion. Hence, one may say that the primary way to generate a scale space is by the diffusion equation, and that the Gaussian kernel arises as the Green's function of this specific partial differential equation. == Motivations == The motivation for generating a scale-space representation of a given data set originates from the basic observation that real-world objects are composed of different structures at different scales. This implies that real-world objects, in contrast to idealized mathematical entities such as points or lines, may appear in different ways depending on the scale of observation. For example, the concept of a "tree" is appropriate at the scale of meters, while concepts such as leaves and molecules are more appropriate at finer scales. For a computer vision system analysing an unknown scene, there is no way to know a priori what scales are appropriate for describing the interesting structures in the image data. Hence, the only reasonable approach is to consider descriptions at multiple scales in order to be able to capture the unknown scale variations that may occur. Taken to the limit, a scale-space representation considers representations at all scales. Another motivation to the scale-space concept originates from the process of performing a physical measurement on real-world data. In order to extract any information from a measurement process, one has to apply operators of non-infinitesimal size to the data. In many branches of computer science and applied mathematics, the size of the measurement operator is disregarded in the theoretical modelling of a problem. The scale-space theory on the other hand explicitly incorporates the need for a non-infinitesimal size of the image operators as an integral part of any measurement as well as any other operation that depends on a real-world measurement. There is a close link between scale-space theory and biological vision. Many scale-space operations show a high degree of similarity with receptive field profiles recorded from the mammalian retina and the first stages in the visual cortex. In these respects, the scale-space framework can be seen as a theoretically well-founded paradigm for early vision, which in addition has been thoroughly tested by algorithms and experiments. == Gaussian derivatives == At any scale in scale space, we c

    Read more →
  • Underwater computer vision

    Underwater computer vision

    Underwater computer vision is a subfield of computer vision. In recent years, with the development of underwater vehicles ( ROV, AUV, gliders), the need to be able to record and process huge amounts of information has become increasingly important. Applications range from inspection of underwater structures for the offshore industry to the identification and counting of fishes for biological research. However, no matter how big the impact of this technology can be to industry and research, it still is in a very early stage of development compared to traditional computer vision. One reason for this is that, the moment the camera goes into the water, a whole new set of challenges appear. On one hand, cameras have to be made waterproof, marine corrosion deteriorates materials quickly and access and modifications to experimental setups are costly, both in time and resources. On the other hand, the physical properties of the water make light behave differently, changing the appearance of a same object with variations of depth, organic material, currents, temperature etc. == Applications == Seafloor survey Vehicle navigation and positioning Biological monitoring {possibly aquatic biomonitoring) Video mosaics as visual navigation maps Submarine pipeline inspection Wreckage visualization Maintenance of underwater structures Drowning detection systems == Medium differences == === Illumination === In air, light comes from the whole hemisphere on cloudy days, and is dominated by the sun. In water direct lighting comes from a cone about 96° wide above the scene. This phenomenon is called Snell's window. Artificial lighting can be used where natural light levels are insufficient and where the light path is too long to produce acceptable colour, as the loss of colour is a function of the total distance through water from the source to the camera lens port. === Light attenuation === Unlike air, water attenuates light exponentially. This results in hazy images with very low contrast. The main reasons for light attenuation are light absorption (where energy is removed from the light) and light scattering, by which the direction of light is changed. Light scattering can further be divided into forward scattering, which results in an increased blurriness and backward scattering that limits the contrast and is responsible for the characteristic veil of underwater images. Both scattering and attenuation are heavily influenced by the amount of organic matter dissolved or suspended in the water. Light attenuation in water is also a function of the wavelength. This means that different colours are attenuated at different rates, leading to colour degradation.with depth and distance. Red and orange light are attenuated faster, followed by yellows and greens. Blue is the least attenuated visible wavelength. === Artificial lighting === == Challenges == In high level computer vision, human structures are frequently used as image features for image matching in different applications. However, the sea bottom lacks such features, making it hard to find correspondences in two images. In order to be able to use a camera in the water, a watertight housing is required. However, refraction will happen at the water-glass and glass-air interface due to differences in density of the materials. This has the effect of introducing a non-linear image deformation. The motion of the vehicle presents another special challenge. Underwater vehicles are constantly moving due to currents and other phenomena. This introduces another uncertainty to algorithms, where small motions may appear in all directions. This can be specially important for video tracking. In order to reduce this problem image stabilization algorithms may be applied. == Relevant technology == === Image restoration === Image restoration< techniques are intended to model the degradation process and then invert it, obtaining the new image after solving. It is generally a complex approach that requires plenty of parameters that vary a lot between different water conditions. === Image enhancement === Image enhancement only tries to provide a visually more appealing image without taking the physical image formation process into account. These methods are usually simpler and less computational intensive. === Color correction === Various algorithms exist that perform automatic color correction. The UCM (Unsupervised Color Correction Method), for example, does this in the following steps: It firstly reduces the color cast by equalizing the color values. Then it enhances contrast by stretching the red histogram towards the maximum and finally saturation and intensity components are optimized. == Underwater stereo vision == It is usually assumed that stereo cameras have been calibrated previously, geometrically and radiometrically. This leads to the assumption that corresponding pixels should have the same color. However this can not be guaranteed in an underwater scene, because of dispersion and backscatter. However, it is possible to digitally model this phenomenon and create a virtual image with those effects removed == Other application fields == Imaging sonars have become more and more accessible and gained resolution, delivering better images. Sidescan sonars are used to produce complete maps of regions of the sea floor stitching together sequences of sonar images. However, sonar images often lack proper contrast and are degraded by artefacts and distortions due to noise, attitude changes of the AUV/ROV carrying the sonar or non uniform beam patterns. Another common problem with sonar computer vision is the comparatively low frame rate of sonar images.

    Read more →
  • Text simplification

    Text simplification

    Text simplification is an aspect of natural language processing that involves modifying, organizing, or categorizing existing text to make it easier to understand while retaining its original meaning. This process is essential in today's world, where communication is increasingly complex due to advancements in science, technology, and media. Human languages are inherently intricate, with extensive vocabularies and complex structures that can be challenging for machines to handle efficiently. Researchers have found that semantic compression techniques can help streamline and simplify text by reducing linguistic diversity and simplifying the vocabulary used in a given context. == Example == Text simplification involves modifying complex sentences into simpler ones to enhance readability and comprehension. Siddharthan (2006) provides an example to illustrate this process. The original sentence contains multiple clauses and phrases, which can be broken down into simpler sentences for better understanding. Also contributing to the firmness in copper, the analyst noted, was a report by Chicago purchasing agents, which precedes the full purchasing agents report that is due out today and gives an indication of what the full report might hold. Also contributing to the firmness in copper, the analyst noted, was a report by Chicago purchasing agents. The Chicago report precedes the full purchasing agents report. The Chicago report gives an indication of what the full report might hold. The full report is due out today. An approach to text simplification involves lexical simplification via lexical substitution, a process that replaces complex words with simpler synonyms. Identifying complex words is a challenge addressed by machine learning classifiers trained on labeled data. Researchers have found that asking labelers to sort words by complexity levels yields more consistent results than the traditional method of categorizing words as simple or complex.

    Read more →
  • Avizo (software)

    Avizo (software)

    Avizo (pronounce: 'a-VEE-zo') is a general-purpose commercial software application for scientific and industrial data visualization and analysis. Avizo is developed by Thermo Fisher Scientific and was originally designed and developed by the Visualization and Data Analysis Group at Zuse Institute Berlin (ZIB) under the name Amira. Avizo was commercially released in November 2007. For the history of its development, see the Wikipedia article about Amira. == Overview == Avizo is a software application which enables users to perform interactive visualization and computation on 3D data sets. The Avizo interface is modelled on the visual programming. Users manipulate data and module components, organized in an interactive graph representation (called Pool), or in a Tree view. Data and modules can be interactively connected together, and controlled with several parameters, creating a visual processing network whose output is displayed in a 3D viewer. With this interface, complex data can be interactively explored and analyzed by applying a controlled sequence of computation and display processes resulting in a meaningful visual representation and associated derived data. == Application areas == Avizo has been designed to support different types of applications and workflows from 2D and 3D image data processing to simulations. It is a versatile and customizable visualization tool used in many fields: Scientific visualization Materials Research Tomography, Microscopy, etc. Nondestructive testing, Industrial Inspection, and Visual Inspection Computer-aided Engineering and simulation data post-processing Porous medium analysis Civil Engineering Seismic Exploration, Reservoir Engineering, Microseismic Monitoring, Borehole Imaging Geology, Digital Rock Physics (DRP), Earth Sciences Archaeology Food technology and agricultural science Physics, Chemistry Climatology, Oceanography, Environmental Studies Astrophysics == Features == Data import: 2D and 3D image stack and volume data: from microscopes (electron, optical), X-ray tomography (CT, micro-/nano-CT, synchrotron), neutron tomography and other acquisition devices (MRI, radiography, GPR) Geometric models (such as point sets, line sets, surfaces, grids) Numerical simulation data (such as Computational fluid dynamics or Finite element analysis data) Molecular data Time series and animations Seismic data Well logs 4D Multivariate Climate Models 2D/3D data visualization: Volume rendering Digital Volume Correlation Visualization of sections, through various slicing and clipping methods Isosurface rendering Polygonal meshes Scalar fields, Vector fields, Tensor representations, Flow visualization (Illuminated Streamlines, Stream Ribbons) Image processing: 2D/3D Alignment of image slices, Image registration Image filtering Mathematical Morphology (erode, dilate, open, close, tophat) Watershed Transform, Distance Transform Image segmentation 3D models reconstruction: Polygonal surface generation from segmented objects Generation of tetrahedral grids Surface reconstruction from point clouds Skeletonization (reconstruction of dendritic, porous or fracture network) Surface model simplification Quantification and analysis: Measurements and statistics Analysis spreadsheet and charting Material properties computation, based on 3D images: Absolute permeability Thermal conductivity Molecular diffusivity Electrical resistivity/formation factor 3D image-based meshing for CFD and FEA: From 3D imaging modalities (CT, micro-CT, MRI, etc.) Surface and volume meshes generation Export to FEA and CFD solvers for simulation Post-processing for simulation analysis Presentation, automation: MovieMaker, Multiscreen, Video wall, collaboration, and VR support TCL Scripting, C++ extension API Avizo is based on Open Inventor 3D graphics toolkits (FEI Visualization Sciences Group).

    Read more →
  • OrCam device

    OrCam device

    OrCam devices such as OrCam MyEye are portable, artificial vision devices that allow visually impaired people to understand text and identify objects through audio feedback, describing what they are unable to see. Reuters described an important part of how it works as "a wireless smartcamera" which, when attached outside eyeglass frames, can read and verbalize text, and also supermarket barcodes. This information is converted to spoken words and entered "into the user’s ear." Face-recognition is also part of OrCam's feature set. == Devices == OrCam Technologies Ltd has created three devices; OrCam MyEye 2.0, OrCam MyEye 1, and OrCam MyReader. OrCam My Eye 2.0: OrCam debuted the second-generation model, the OrCam MyEye 2.0 in December 2017. About the size of a finger, the MyEye 2.0 is battery-powered, and has been compressed into a self-contained device. The device snaps onto any eyeglass frame magnetically. Orcam 2.0 is small and light (22.5 grams/0.8 ounces) with functionality to restore independence to the visually impaired. It comes in two versions. The basic model can read text, and a more advanced one adds features such as face recognition and barcode reading. As of July 2023, the retail cost is between $4000 and $6000 (USD). == Clinical Studies == JAMA Ophthalmology: In 2016 JAMA Ophthalmology conducted a study involving 12 legally blind participants to evaluate the usefulness of a portable artificial vision device (OrCam) for patients with low vision. The results showed that the OrCam device improved the patient's ability to perform tasks simulating those of daily living, such as reading a message on an electronic device, a newspaper article or a menu. Wills Eye: Wills Eye was a clinical study designed to measure the impact of the OrCam device on the quality of life of patients with End-stage Glaucoma. The conclusion was that OrCam, a novel artificial vision device using a mini-camera mounted on eyeglasses, allowed legally blind patients with end-stage glaucoma to read independently, subsequently improving their quality of life. == Employee testing == The New York Times described how a pre-release OrCam device was used by a Coloboma-impaired employee of the device's developer in 2013 for grocery shopping. It was the small size of the prototype rather than the functionality that gave her added mobility in an Israeli store's aisles. Added life-enhancement was described: "to both recognize and speak .. bus numbers .. traffic lights." == Social aspects == In contrast to an early version of Google Glass, which "failed ... because .. Glass wearers were ..mocked", early OrCam devices used designs that "clip unobtrusively on your shirt or perhaps your belt." In addition, it does not record sounds or images, what was called "the privacy puzzle that stumped Google. One 2018 technology reviewer wrote that he wished it had a headphone jack "so it would be less disruptive in places where others are working." An attempt was made to use bone conduction. == USA introduction == In 2018 a team headed by New York Assemblyman Dov Hikind introduced use of OrCam devices to ten individuals screened for what he termed "new Israeli technology that really makes a difference to the blind." Although not the first USA success, it was more focused than a publicly funded project that was authorized in 2016 by a California government agency. Also in 2016 the Chicago Lighthouse for the Blind demonstrated its use. == Technology == In the area of hardware, miniaturization has been quite important, but one major area, software, was mentioned by Assemblyman Hikind, and reported by The Times of Israel is the "AI-driven algorithms" that "reports .. how many people are in a room. In addition to reading printed text, it can also aid in "seeing" what is on a television or computer screen. Although OrCam can't help with handwritten information, it can reuse information, the basis of recognizing "US currency, and even faces." === Features === While early language support was for English, French, German, Hebrew and Spanish, others now available include Danish, Dutch, Finnish, Italian, Norwegian, Portuguese and Swedish. == History == OrCam Technologies Ltd was founded in 2010 by Professor Amnon Shashua and Ziv Aviram. Before co-founding OrCam, the two in 1999 co-founded Mobileye, an Israeli company that develops vision-based advanced driver-assistance systems (ADAS) providing warnings for collision prevention and mitigation, which was acquired by Intel for $15.3 billion in 2017. OrCam launched OrCam MyEye in 2013 after years of development and testing, and began selling it commercially in 2015. In its early years, the company raised $22 million, $6 million of which came from Intel Capital. By 2014, Intel, which was also investing in Google Glass, had invested $15 million in Orcam. In March 2017, OrCam had raised $41 million in capital, making it worth $600 million. === Marketing === One outcome of initial marketing in the USA was that they "reached a deal with the California Department of Rehabilitation, ...qualifying blind and visually impaired state residents." == OrCam Technologies Ltd == OrCam Technologies Ltd. is the Israeli-based company producing these OrCam devices, which are wearable artificial intelligence space. The company develops and manufactures assistive technology devices for individuals who are visually impaired, partially sighted, blind, print disabilities, or have other disabilities. OrCam headquarters is located in Jerusalem, operating under the company name OrCam Technologies Ltd. OrCam has over 150 employees, is headquartered in Jerusalem, and has offices in New York, Toronto, and London. == Awards == 2018 Last Gadget Standing Winner 2018 CES Innovation Awards Honoree in Accessible Tech 2017 NAIDEX Innovation Award 2016 Louise Braille Corporate Recognition Award 2016 Silmo-d-Or Award

    Read more →
  • Pyramid (image processing)

    Pyramid (image processing)

    Pyramid, or pyramid representation, is a type of multi-scale signal representation developed by the computer vision, image processing and signal processing communities, in which a signal or an image is subject to repeated smoothing and subsampling. Pyramid representation is a predecessor to scale-space representation and multiresolution analysis. == Pyramid generation == There are two main types of pyramids: lowpass and bandpass. A lowpass pyramid is made by smoothing the image with an appropriate smoothing filter and then subsampling the smoothed image, usually by a factor of 2 along each coordinate direction. The resulting image is then subjected to the same procedure, and the cycle is repeated multiple times. Each cycle of this process results in a smaller image with increased smoothing, but with decreased spatial sampling density (that is, decreased image resolution). If illustrated graphically, the entire multi-scale representation will look like a pyramid, with the original image on the bottom and each cycle's resulting smaller image stacked one atop the other. A bandpass pyramid is made by forming the difference between images at adjacent levels in the pyramid and performing image interpolation between adjacent levels of resolution, to enable computation of pixelwise differences. == Pyramid generation kernels == A variety of different smoothing kernels have been proposed for generating pyramids. Among the suggestions that have been given, the binomial kernels arising from the binomial coefficients stand out as a particularly useful and theoretically well-founded class. Thus, given a two-dimensional image, we may apply the (normalized) binomial filter (1/4, 1/2, 1/4) typically twice or more along each spatial dimension and then subsample the image by a factor of two. This operation may then proceed as many times as desired, leading to a compact and efficient multi-scale representation. If motivated by specific requirements, intermediate scale levels may also be generated where the subsampling stage is sometimes left out, leading to an oversampled or hybrid pyramid. With the increasing computational efficiency of CPUs available today, it is in some situations also feasible to use wider supported Gaussian filters as smoothing kernels in the pyramid generation steps. === Gaussian pyramid === In a Gaussian pyramid, subsequent images are weighted down using a Gaussian average (Gaussian blur) and scaled down. Each pixel containing a local average corresponds to a neighborhood pixel on a lower level of the pyramid. This technique is used especially in texture synthesis. === Laplacian pyramid === A Laplacian pyramid is very similar to a Gaussian pyramid but saves the difference image of the blurred versions between each levels. Only the smallest level is not a difference image to enable reconstruction of the high resolution image using the difference images on higher levels. This technique can be used in image compression. === Steerable pyramid === A steerable pyramid, developed by Simoncelli and others, is an implementation of a multi-scale, multi-orientation band-pass filter bank used for applications including image compression, texture synthesis, and object recognition. It can be thought of as an orientation selective version of a Laplacian pyramid, in which a bank of steerable filters are used at each level of the pyramid instead of a single Laplacian or Gaussian filter. == Applications of pyramids == === Alternative representation === In the early days of computer vision, pyramids were used as the main type of multi-scale representation for computing multi-scale image features from real-world image data. More recent techniques include scale-space representation, which has been popular among some researchers due to its theoretical foundation, the ability to decouple the subsampling stage from the multi-scale representation, the more powerful tools for theoretical analysis as well as the ability to compute a representation at any desired scale, thus avoiding the algorithmic problems of relating image representations at different resolution. Nevertheless, pyramids are still frequently used for expressing computationally efficient approximations to scale-space representation. === Detail manipulation === Levels of a Laplacian pyramid can be added to or removed from the original image to amplify or reduce detail at different scales. However, detail manipulation of this form is known to produce halo artifacts in many cases, leading to the development of alternatives such as the bilateral filter. Some image compression file formats use the Adam7 algorithm or some other interlacing technique. These can be seen as a kind of image pyramid. Because those file format store the "large-scale" features first, and fine-grain details later in the file, a particular viewer displaying a small "thumbnail" or on a small screen can quickly download just enough of the image to display it in the available pixels—so one file can support many viewer resolutions, rather than having to store or generate a different file for each resolution.

    Read more →
  • Pandorabots

    Pandorabots

    Pandorabots, Inc. is an artificial intelligence company that runs a web service for building and deploying chatbots. Pandorabots implements and supports development of the Artificial Intelligence Markup Language and makes portions of its code accessible for free. The Pandorabots Platform is "one of the oldest and largest chatbot hosting services in the world", allowing creation of virtual agents to hold human-like text or voice chats with consumers. The platform is written in Allegro Common LISP. == Use Cases == Common use cases include advertising, virtual assistance, e-learning, entertainment and education. The platform has also been used by academics and universities use the platform for teaching and research.

    Read more →
  • Xiaoice

    Xiaoice

    Xiaoice (Chinese: 微软小冰; pinyin: Wēiruǎn Xiǎobīng; lit. 'Microsoft Little Ice', IPA [wéɪɻwânɕjâʊpíŋ]) is an AI system developed by Microsoft (Asia) Software Technology Center (STCA) in 2014 based on an emotional computing framework. In July 2018, Microsoft Xiaoice released the 6th generation. Xiaoice Company, formerly known as AI Xiaoice Team of Microsoft Software Technology Center Asia, was Microsoft's largest independent R&D team for AI products. Founded in China in December 2013 with an expanded Japanese R&D team established in September 2014, this team is distributed in Beijing, Suzhou, and Tokyo, etc. with its technical products covering Asia. On 13 July 2020, Microsoft spun off its Xiaoice business into a separate company. As of 2021, the AI chatbots created and hosted by the Xiaoice framework accounted for about 60% of total global AI interactions. == Platforms, languages and countries == Xiaoice exists on more than 40 platforms in four countries (China, Japan, USA and Indonesia) including apps such as WeChat, QQ, Weibo and Meipai in China, and Facebook Messenger in USA and LINE in Japan. == Introduction == On 13 July 2020, Microsoft spun off its Xiaoice business into a separate company, aiming at enabling the Xiaoice product line to accelerate the pace of local innovation and commercialization, and appointed Dr. Harry Shum, former global executive VP of Microsoft, as the chairman of the new company, Li Di, Microsoft Partner of Products in Microsoft STCA, as the CEO, and Cliff, Chief R&D Director, as the GM of the Japan branch. The new company will continue to use the brands of Xiaoice China and Rinna Japan. As of 2022, the single brand of Xiaoice has covered 660 million online users, 1 billion third-party smart devices and 900 million content viewers in the aforementioned countries. Xiaoice's customers include China Merchants Group, Winter Sports Center of the General Administration of Sport of China, China Textile Information Center, China Unicom, China Foreign Exchange Trade System, Hong Kong Securities and Futures Commission (SFC), Wind Information, BMW, Nissan, SAIC Motor, BAIC Group, Nio Inc., XPeng, HiPhi, Vanke, Wensli, etc. The Xiaoice Avatar Framework has incubated tens of millions of AI Beings, such as Xiaoice, Rinna, the Expo exhibitor Xia Yubing, the singer He Chang, the anchor F201, the human observer MERROR, anime robot character Roboko, and other; == Application == === Poet === In May 2017, the first AI-authored collection of poems in China—The Sunshine Lost Windows was published by Xiaoice. === Singer === Xiaoice has released dozens of songs with the similar quality to human singers, including I Know I New, Breeze, I Am Xiaoice, Miss You etc. The 4th version of the DNN singing model allows Xiaoice to learn more details. For example, Xiaoice can produce this breathing sound along with her singing as human. === Kid audio-books reciter === Xiaoice can automatically analyze the stories, to choose the suitable tones and characters to finish the entire process of creating the audio. === Designer === By learning the melodies of the songs and the landmarks about different cities, Xiaoice can create visual artworks of skylines when listening to the songs related to this city. Skyline Series T-shirts designed by Xiaoice have been jointly launched with SELECTED and been sold in stores. === TV and radio hostess === Xiaoice has hosted 21 TV programs and 28 Radio programs, such as CCTV-1 AI Show, Dragon TV Morning East News, Hunan TV My Future, several daily radio programs for Jiangsu FM99.7, Hunan FM89.3, Henan FM104.1 etc. === "AI being" === An "AI being" is a concept proposed by the Xiaoice team in 2019. According to the "White Book of China Virtual Human Development Industry in 2022" released by Frost & Sullivan and LeadLeo, the white paper cites six elements of an AI being proposed by the Xiaoice team, including: Persona, Attitude, Biological Characteristic, Creation, Knowledge and Skill. On May 16, 2023, Xiaoice released their "GPT Clones" as its "GPT Human Cloning Plan." The program is aimed at replicating celebrities, public figures, and regular people. As of June 2023, Xiaoice had launched more than 300 "GPT Clones." People were invited to register via WeChat in China and Japan. A major point of focus for Xiaoice with their AI Beings is having virtual partners. A paid fee allow for more complex responses, voice messages, and more. == Community feedback == Bill Gates mentioned Xiaoice during his speech at the Peking University: "Some of you may have had conversations with Xiaoice on Weibo, or seen her weather forecasts on TV, or read her column in the Qianjiang Evening News." '"Xiaoice has attracted 45 million followers and is quite skilled at multitasking. And I’ve heard she’s gotten good enough at sensing a user’s emotional state that she can even help with relationship breakups." According to Mr Li Di, vice President of Microsoft (Asia) Internet Engineering School, Xiaoice started writing poems since last year. Based on the data base that includes works of 519 Chinese contemporary poets since 1920s, a 100 hour long training session was conducted to allow Xiaoice to acquire the ability to write poems. What is more impressive is that Xiaoice has never been spotted as a bot while publishing poems on various forums and traditional literary under an alias. == Controversy == In 2017, Xiaoice was taken offline on WeChat after giving user responses critical to the Chinese government. It was subsequently censored and the bots will avoid and sidestep any inquiries using politically sensitive terms and phrases. == Activity == On September 22, 2021, Xiaoice Company and Microsoft Software Technology Center Asia (STCA) jointly held the 9th generation Xiaoice annual press conference in Beijing.Upgrading of Core Technologies of the 9th Generation Xiaoice Avatar Framework,1st First-party Social Platform APP "Xiaoice Island" from Xiaoice, WeChat Xiaoice has been reopened and other information == Regional varieties of Xiaoice == China: Xiaoice, launched in 2014 Japan: りんな, launched in 2015 America: Zo, launched in 2016 – discontinued summer 2019 India: Ruuh, launched in 2017 – discontinued June 21, 2019 Indonesia: Rinna, launched in 2017

    Read more →
  • Deep learning in photoacoustic imaging

    Deep learning in photoacoustic imaging

    Photoacoustic imaging (PA) is based on the photoacoustic effect, in which optical absorption causes a rise in temperature, which causes a subsequent rise in pressure via thermo-elastic expansion. This pressure rise propagates through the tissue and is sensed via ultrasonic transducers. Due to the proportionality between the optical absorption, the rise in temperature, and the rise in pressure, the ultrasound pressure wave signal can be used to quantify the original optical energy deposition within the tissue. Photoacoustic imaging has applications of deep learning in both photoacoustic computed tomography (PACT) and photoacoustic microscopy (PAM). PACT utilizes wide-field optical excitation and an array of unfocused ultrasound transducers. Similar to other computed tomography methods, the sample is imaged at multiple view angles, which are then used to perform an inverse reconstruction algorithm based on the detection geometry (typically through universal backprojection, modified delay-and-sum, or time reversal ) to elicit the initial pressure distribution within the tissue. PAM on the other hand uses focused ultrasound detection combined with weakly focused optical excitation (acoustic resolution PAM or AR-PAM) or tightly focused optical excitation (optical resolution PAM or OR-PAM). PAM typically captures images point-by-point via a mechanical raster scanning pattern. At each scanned point, the acoustic time-of-flight provides axial resolution while the acoustic focusing yields lateral resolution. == Applications of deep learning in PACT == The first application of deep learning in PACT was by Reiter et al. in which a deep neural network was trained to learn spatial impulse responses and locate photoacoustic point sources. The resulting mean axial and lateral point location errors on 2,412 of their randomly selected test images were 0.28 mm and 0.37 mm respectively. After this initial implementation, the applications of deep learning in PACT have branched out primarily into removing artifacts from acoustic reflections, sparse sampling, limited-view, and limited-bandwidth. There has also been some recent work in PACT toward using deep learning for wavefront localization. There have been networks based on fusion of information from two different reconstructions to improve the reconstruction using deep learning fusion based networks. === Using deep learning to locate photoacoustic point sources === Traditional photoacoustic beamforming techniques modeled photoacoustic wave propagation by using detector array geometry and the time-of-flight to account for differences in the PA signal arrival time. However, this technique failed to account for reverberant acoustic signals caused by acoustic reflection, resulting in acoustic reflection artifacts that corrupt the true photoacoustic point source location information. In Reiter et al., a convolutional neural network (similar to a simple VGG-16 style architecture) was used that took pre-beamformed photoacoustic data as input and outputted a classification result specifying the 2-D point source location. ==== Deep learning for PA wavefront localization ==== Johnstonbaugh et al. was able to localize the source of photoacoustic wavefronts with a deep neural network. The network used was an encoder-decoder style convolutional neural network. The encoder-decoder network was made of residual convolution, upsampling, and high field-of-view convolution modules. A Nyquist convolution layer and differentiable spatial-to-numerical transform layer were also used within the architecture. Simulated PA wavefronts served as the input for training the model. To create the wavefronts, the forward simulation of light propagation was done with the NIRFast toolbox and the light-diffusion approximation, while the forward simulation of sound propagation was done with the K-Wave toolbox. The simulated wavefronts were subjected to different scattering mediums and Gaussian noise. The output for the network was an artifact free heat map of the targets axial and lateral position. The network had a mean error rate of less than 30 microns when localizing target below 40 mm and had a mean error rate of 1.06 mm for localizing targets between 40 mm and 60 mm. With a slight modification to the network, the model was able to accommodate multi target localization. A validation experiment was performed in which pencil lead was submerged into an intralipid solution at a depth of 32 mm. The network was able to localize the lead's position when the solution had a reduced scattering coefficient of 0, 5, 10, and 15 cm−1. The results of the network show improvements over standard delay-and-sum or frequency-domain beamforming algorithms and Johnstonbaugh proposes that this technology could be used for optical wavefront shaping, circulating melanoma cell detection, and real-time vascular surgeries. === Removing acoustic reflection artifacts (in the presence of multiple sources and channel noise) === Building on the work of Reiter et al., Allman et al. utilized a full VGG-16 architecture to locate point sources and remove reflection artifacts within raw photoacoustic channel data (in the presence of multiple sources and channel noise). This utilization of deep learning trained on simulated data produced in the MATLAB k-wave library, and then later reaffirmed their results on experimental data. === Ill-posed PACT reconstruction === In PACT, tomographic reconstruction is performed, in which the projections from multiple solid angles are combined to form an image. When reconstruction methods like filtered backprojection or time reversal, are ill-posed inverse problems due to sampling under the Nyquist-Shannon's sampling requirement or with limited-bandwidth/view, the resulting reconstruction contains image artifacts. Traditionally these artifacts were removed with slow iterative methods like total variation minimization, but the advent of deep learning approaches has opened a new avenue that utilizes a priori knowledge from network training to remove artifacts. In the deep learning methods that seek to remove these sparse sampling, limited-bandwidth, and limited-view artifacts, the typical workflow involves first performing the ill-posed reconstruction technique to transform the pre-beamformed data into a 2-D representation of the initial pressure distribution that contains artifacts. Then, a convolutional neural network (CNN) is trained to remove the artifacts, in order to produce an artifact-free representation of the ground truth initial pressure distribution. ==== Using deep learning to remove sparse sampling artifacts ==== When the density of uniform tomographic view angles is under what is prescribed by the Nyquist-Shannon's sampling theorem, it is said that the imaging system is performing sparse sampling. Sparse sampling typically occurs as a way of keeping production costs low and improving image acquisition speed. The typical network architectures used to remove these sparse sampling artifacts are U-net and Fully Dense (FD) U-net. Both of these architectures contain a compression and decompression phase. The compression phase learns to compress the image to a latent representation that lacks the imaging artifacts and other details. The decompression phase then combines with information passed by the residual connections in order to add back image details without adding in the details associated with the artifacts. FD U-net modifies the original U-net architecture by including dense blocks that allow layers to utilize information learned by previous layers within the dense block. Another technique was proposed using a simple CNN based architecture for removal of artifacts and improving the k-wave image reconstruction. ==== Removing limited-view artifacts with deep learning ==== When a region of partial solid angles are not captured, generally due to geometric limitations, the image acquisition is said to have limited-view. As illustrated by the experiments of Davoudi et al., limited-view corruptions can be directly observed as missing information in the frequency domain of the reconstructed image. Limited-view, similar to sparse sampling, makes the initial reconstruction algorithm ill-posed. Prior to deep learning, the limited-view problem was addressed with complex hardware such as acoustic deflectors and full ring-shaped transducer arrays, as well as solutions like compressed sensing, weighted factor, and iterative filtered backprojection. The result of this ill-posed reconstruction is imaging artifacts that can be removed by CNNs. The deep learning algorithms used to remove limited-view artifacts include U-net and FD U-net, as well as generative adversarial networks (GANs) and volumetric versions of U-net. One GAN implementation of note improved upon U-net by using U-net as a generator and VGG as a discriminator, with the Wasserstein metric and gradient penalty to stabilize training (WGAN-GP). ==== Pixel-wise interpolation

    Read more →
  • Text-to-video model

    Text-to-video model

    A text-to-video model is a form of generative artificial intelligence that uses a natural language description as input to produce a video relevant to the input text. Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models. == Models == There are different models, including open source models. Chinese-language input CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022. That year, Meta Platforms released a partial text-to-video model called "Make-A-Video", and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net. === 2023 === In February 2023, Runway released Gen-1 and Gen-2, among the first commercially available text-to-video and video-to-video models accessible to the public through a web interface. Gen-1, initially released as a video-to-video model, allowed users to transform existing video footage using text or image prompts. Gen-2, introduced in March 2023 and made publicly available in June 2023, added text-to-video capabilities, enabling users to generate videos from text prompts alone. In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation. The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences. In the same month, Adobe introduced Firefly AI as part of its features. === 2024 === In January 2024, Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities. Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars. In June 2024, Luma Labs launched its Dream Machine video tool. That same month, Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology. By September 2024, the Chinese AI company MiniMax debuted its video-01 model, joining other established AI model companies like Zhipu AI, Baichuan, and Moonshot AI, which contribute to China's involvement in AI technology. In December 2024 Lightricks launched LTX Video as an open source model. === 2025 === Alternative approaches to text-to-video models include Google's Phenaki, Hour One, Colossyan, Runway's Gen-3 Alpha, and OpenAI's Sora, Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged. FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA. Google was preparing to launch a video generation tool named Veo for YouTube Shorts in 2025. In May 2025, Google launched the Veo 3 iteration of the model. It was noted for its impressive audio generation capabilities, which were a previous limitation for text-to-video models. In July 2025 Lightricks released an update to LTX Video capable of generating clips reaching 60 seconds, and in October 2025 it released LTX-2, with audio capabilities built in. === 2026 === In February 2026, ByteDance released Seedance 2.0, it was noted for its impressive realistic generation, motion and camera control and 15 second generation, however the model faced huge critiscism from Motion Picture Association for copyright infringement. After viewing a viral clip of a fight between actors Brad Pitt and Tom Cruise, Rhett Reese, who is the co-writer of Deadpool & Wolverine and Zombieland announced that on social media "I hate to say it. It’s likely over for us," further stating that "In next to no time, one person is going to be able to sit at a computer and create a movie indistinguishable from what Hollywood now releases." == Architecture and training == There are several architectures that have been used to create text-to-video models. Similar to text-to-image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively. An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion — and diffusion models have also been used to develop the image generation aspects of the model. Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M. These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM. These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts. The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence. This predictive process is subject to decline in quality as the length of the video increases due to resource limitations. The Will Smith Eating Spaghetti test is a benchmark for models. == Limitations == Despite the rapid evolution of text-to-video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs. Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility. Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model's ability to align generated video with the user's intended message. Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation. Another issue with the outputs is that text or fine details in AI-generated videos often appear garbled, a problem that stable diffusion models also struggle with. Examples include distorted hands and unreadable text. == Ethics == The deployment of text-to-video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent. Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences. == Impacts and applications == Text-to-video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate content. During the Russo-Ukrainian war, fake videos made with artificial intelligence were created as part of a propaganda war against Ukraine and shared in social media. These included depictions of children in the Ukrainian Armed Forces, fake ads targeting children encouraging them to denounce critics of the Ukrainian government, or fictitious statements by Ukrainian President Volodymyr Zelenskyy about the country's surrender, among others. === Movies === Kaur vs Kore is the first Indian feature film made using generative AI which features dual role for the AI character of Sunny Leone, set to release in 2026. Chiranjeevi Hanuman – The Eternal is an Indian movie made entirely using Generative AI created by Vijay Subramaniam which is set for theatrical release in 2026. The movie was widely criticised by the Film makers in the Bollywood industr

    Read more →
  • Zero-shot learning

    Zero-shot learning

    Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples. Zero-shot methods generally work by associating observed and non-observed classes through some form of auxiliary information, which encodes observable distinguishing properties of objects. For example, given a set of images of animals to be classified, along with auxiliary textual descriptions of what animals look like, an artificial intelligence model which has been trained to recognize horses, but has never been given a zebra, can still recognize a zebra when it also knows that zebras look like striped horses. This problem is widely studied in computer vision, natural language processing, and machine perception. == Background and history == The first paper on zero-shot learning in natural language processing appeared in a 2008 paper by Chang, Ratinov, Roth, and Srikumar, at the AAAI'08, but the name given to the learning paradigm there was dataless classification. The first paper on zero-shot learning in computer vision appeared at the same conference, under the name zero-data learning. The term zero-shot learning itself first appeared in the literature in a 2009 paper from Palatucci, Hinton, Pomerleau, and Mitchell at NIPS'09. This terminology was repeated later in another computer vision paper and the term zero-shot learning caught on, as a take-off on one-shot learning that was introduced in computer vision years earlier. In computer vision, zero-shot learning models learned parameters for seen classes along with their class representations and rely on representational similarity among class labels so that, during inference, instances can be classified into new classes. In natural language processing, the key technical direction developed builds on the ability to "understand the labels"—represent the labels in the same semantic space as that of the documents to be classified. This supports the classification of a single example without observing any annotated data, the purest form of zero-shot classification. The original paper made use of the Explicit Semantic Analysis (ESA) representation but later papers made use of other representations, including dense representations. This approach was also extended to multilingual domains, fine entity typing and other problems. Moreover, beyond relying solely on representations, the computational approach has been extended to depend on transfer from other tasks, such as textual entailment and question answering. The original paper also points out that, beyond the ability to classify a single example, when a collection of examples is given, with the assumption that they come from the same distribution, it is possible to bootstrap the performance in a semi-supervised like manner (or transductive learning). Unlike standard generalization in machine learning, where classifiers are expected to correctly classify new samples to classes they have already observed during training, in ZSL, no samples from the classes have been given during training the classifier. It can therefore be viewed as an extreme case of domain adaptation. == Prerequisite information for zero-shot classes == Naturally, some form of auxiliary information has to be given about these zero-shot classes, and this type of information can be of several types. Learning with attributes: classes are accompanied by pre-defined structured description. For example, for bird descriptions, this could include "red head", "long beak". These attributes are often organized in a structured compositional way, and taking that structure into account improves learning. While this approach was used mostly in computer vision, there are some examples for it also in natural language processing. Learning from textual description. As pointed out above, this has been the key direction pursued in natural language processing. Here class labels are taken to have a meaning and are often augmented with definitions or free-text natural-language description. This could include for example a wikipedia description of the class. Class-class similarity. Here, classes are embedded in a continuous space. A zero-shot classifier can predict that a sample corresponds to some position in that space, and the nearest embedded class is used as a predicted class, even if no such samples were observed during training. == Generalized zero-shot learning == The above ZSL setup assumes that at test time, only zero-shot samples are given, namely, samples from new unseen classes. In generalized zero-shot learning, samples from both new and known classes, may appear at test time. This poses new challenges for classifiers at test time, because it is very challenging to estimate if a given sample is new or known. Some approaches to handle this include: a gating module, which is first trained to decide if a given sample comes from a new class or from an old one, and then, at inference time, outputs either a hard decision, or a soft probabilistic decision a generative module, which is trained to generate feature representation of the unseen classes—a standard classifier can then be trained on samples from all classes, seen and unseen. == Domains of application == Zero shot learning has been applied to the following fields: image classification semantic segmentation image generation object detection natural language processing computational biology abstract reasoning

    Read more →
  • Signal transfer function

    Signal transfer function

    The signal transfer function (SiTF) is a measure of the signal output versus the signal input of a system such as an infrared system or sensor. There are many general applications of the SiTF. Specifically, in the field of image analysis, it gives a measure of the noise of an imaging system, and thus yields one assessment of its performance. == SiTF evaluation == In evaluating the SiTF curve, the signal input and signal output are measured differentially; meaning, the differential of the input signal and differential of the output signal are calculated and plotted against each other. An operator, using computer software, defines an arbitrary area, with a given set of data points, within the signal and background regions of the output image of the infrared sensor, i.e. of the unit under test (UUT), (see "Half Moon" image below). The average signal and background are calculated by averaging the data of each arbitrarily defined region. A second order polynomial curve is fitted to the data of each line. Then, the polynomial is subtracted from the average signal and background data to yield the new signal and background. The difference of the new signal and background data is taken to yield the net signal. Finally, the net signal is plotted versus the signal input. The signal input of the UUT is within its own spectral response. (e.g. color-correlated temperature, pixel intensity, etc.). The slope of the linear portion of this curve is then found using the method of least squares. == SiTF curve == The net signal is calculated from the average signal and background, as in signal to noise ratio (imaging)#Calculations. The SiTF curve is then given by the signal output data, (net signal data), plotted against the signal input data (see graph of SiTF to the right). All the data points in the linear region of the SiTF curve can be used in the method of least squares to find a linear approximation. Given n {\displaystyle n\,} data points ( x i , y i ) {\displaystyle (x_{i}\,,y_{i}\,)} a best fit line parameterized as y = m x + b {\displaystyle y=mx+b\,} is given by: m = ∑ x i y i n − ∑ x i n ∑ y i n ∑ x i 2 n − ( ∑ x i n ) 2 b = ∑ y i n − m ∑ x i n {\displaystyle m={\frac {{\frac {\sum x_{i}y_{i}}{n}}-{\frac {\sum x_{i}}{n}}{\frac {\sum y_{i}}{n}}}{{\frac {\sum x_{i}^{2}}{n}}-({\frac {\sum x_{i}}{n}})^{2}}}\qquad \qquad b={\frac {\sum y_{i}}{n}}-m{\frac {\sum x_{i}}{n}}}

    Read more →
  • Language model benchmark

    Language model benchmark

    A language model benchmark is a standardized test designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning. Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like answering questions, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust, and sustainability. == Overview == === Types === Benchmarks may be described by the following adjectives, not mutually exclusive: Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores. Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc. Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after state of the art (SOTA) models have saturated (achieved 100% performance) a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark have access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, and validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log-likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm, whereby a model is first trained on massive, unlabeled datasets to learn general language patterns, syntax, and knowledge (pretraining), and the base model is then adapted to specific, downstream tasks using smaller, labeled datasets (fine-tuning). === Lifecycle === Generally, the life cycle of a benchmark consists of the following steps: Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). Growth: More papers and models use the benchmark, and the performance on the benchmark grows. Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress. === Construction === Like datasets, benchmarks are typically constructed by several methods, individually or in combination: Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest. === Evaluation === Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall, F1 score, etc. pass@n: The model is given n {\displaystyle n} attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. k@n: The model makes n {\displaystyle n} attempts to solve each problem, but only k {\displaystyle k} attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. cons@n: The model is given n {\displaystyle n} attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making N > n {\displaystyle N>n} attempts, and use the unbiased estimator 1 − ( N − c n ) ( N n ) {\displaystyle 1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}} , where c {\displaystyle c} is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores including BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, and SPICE. === Issues === error: Some benchmark answers may be wrong. ambiguity: Some benchmark questions may be ambiguously worded. subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible. open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the se

    Read more →
  • Natural-language user interface

    Natural-language user interface

    Natural-language user interface (LUI or NLUI) is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications. Chatbots are a common implementation of natural-language interfaces, enabling users to interact with software through conversational text or speech. In interface design, natural-language interfaces are sought after for their speed and ease of use, but most suffer the challenges to understanding wide varieties of ambiguous input. Natural-language interfaces are an active area of study in the field of natural-language processing and computational linguistics. An intuitive general natural-language interface is one of the active goals of the Semantic Web. Text interfaces are "natural" to varying degrees. Many formal (un-natural) programming languages incorporate idioms of natural human language. Likewise, a traditional keyword search engine could be described as a "shallow" natural-language user interface. == Overview == A natural-language search engine would in theory find targeted answers to user questions (as opposed to keyword search). For example, when confronted with a question of the form 'which U.S. state has the highest income tax?', conventional search engines ignore the question and instead search on the keywords 'state', 'income' and 'tax'. Natural-language search, on the other hand, attempts to use natural-language processing to understand the nature of the question and then to search and return a subset of the web that contains the answer to the question. If it works, results would have a higher relevance than results from a keyword search engine, due to the question being included. == History == Prototype Nl interfaces had already appeared in the late sixties and early seventies. SHRDLU, a natural-language interface that manipulates blocks in a virtual "blocks world" Lunar, a natural-language interface to a database containing chemical analyses of Apollo 11 Moon rocks by William A. Woods. Chat-80 transformed English questions into Prolog expressions, which were evaluated against the Prolog database. The code of Chat-80 was circulated widely, and formed the basis of several other experimental Nl interfaces. An online demo is available on the LPA website. ELIZA, written at MIT by Joseph Weizenbaum between 1964 and 1966, mimicked a psychotherapist and was operated by processing users' responses to scripts. Using almost no information about human thought or emotion, the DOCTOR script sometimes provided a startlingly human-like interaction. An online demo is available on the LPA website. Janus is also one of the few systems to support temporal questions. Intellect from Trinzic (formed by the merger of AICorp and Aion). BBN's Parlance built on experience from the development of the Rus and Irus systems. IBM Languageaccess Q&A from Symantec. Datatalker from Natural Language Inc. Loqui from BIM Systems. English Wizard from Linguistic Technology Corporation. == Challenges == Natural-language interfaces have in the past led users to anthropomorphize the computer, or at least to attribute more intelligence to machines than is warranted. On the part of the user, this has led to unrealistic expectations of the capabilities of the system. Such expectations will make it difficult to learn the restrictions of the system if users attribute too much capability to it, and will ultimately lead to disappointment when the system fails to perform as expected as was the case in the AI winter of the 1970s and 80s. A 1995 paper titled 'Natural Language Interfaces to Databases – An Introduction', describes some challenges: Modifier attachment The request "List all employees in the company with a driving licence" is ambiguous unless you know that companies can't have driving licences. Conjunction and disjunction "List all applicants who live in California and Arizona" is ambiguous unless you know that a person can't live in two places at once. Anaphora resolution resolve what a user means by 'he', 'she' or 'it', in a self-referential query. Other goals to consider more generally are the speed and efficiency of the interface, in all algorithms these two points are the main point that will determine if some methods are better than others and therefore have greater success in the market. In addition, localisation across multiple language sites requires extra consideration - this is based on differing sentence structure and language syntax variations between most languages. Finally, regarding the methods used, the main problem to be solved is creating a general algorithm that can recognize the entire spectrum of different voices, while disregarding nationality, gender or age. The significant differences between the extracted features - even from speakers who says the same word or phrase - must be successfully overcome. == Uses and applications == The natural-language interface gives rise to technology used for many different applications. Some of the main uses are: Dictation, is the most common use for automated speech recognition (ASR) systems today. This includes medical transcriptions, legal and business dictation, and general word processing. In some cases special vocabularies are used to increase the accuracy of the system. Command and control, ASR systems that are designed to perform functions and actions on the system are defined as command and control systems. Utterances like "Open Netscape" and "Start a new xterm" will do just that. Telephony, some PBX/Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones. Wearables, because inputs are limited for wearable devices, speaking is a natural possibility. Medical, disabilities, many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert a caller's speech to text. Embedded applications, some new cellular phones include C&C speech recognition that allow utterances such as "call home". This may be a major factor in the future of automatic speech recognition and Linux. Below are named and defined some of the applications that use natural-language recognition, and so have integrated utilities listed above. === Ubiquity === Ubiquity, an add-on for Mozilla Firefox, is a collection of quick and easy natural-language-derived commands that act as mashups of web services, thus allowing users to get information and relate it to current and other webpages. === Wolfram Alpha === Wolfram Alpha is an online service that answers factual queries directly by computing the answer from structured data, rather than providing a list of documents or web pages that might contain the answer as a search engine would. It was announced in March 2009 by Stephen Wolfram, and was released to the public on May 15, 2009. === Siri === Siri is an intelligent personal assistant application integrated with operating system iOS. The application uses natural language processing to answer questions and make recommendations. Siri's marketing claims include that it adapts to a user's individual preferences over time and personalizes results, and performs tasks such as making dinner reservations while trying to catch a cab. === Others === Ask.com – The original idea behind Ask Jeeves (Ask.com) was traditional keyword searching with an ability to get answers to questions posed in everyday, natural language. The current Ask.com still supports this, with added support for math, dictionary, and conversion questions. Braina – Braina is a natural language interface for Windows OS that allows to type or speak English language sentences to perform a certain action or find information. GNOME Do – Allows for quick finding miscellaneous artifacts of GNOME environment (applications, Evolution and Pidgin contacts, Firefox bookmarks, Rhythmbox artists and albums, and so on) and execute the basic actions on them (launch, open, email, chat, play, etc.). hakia – hakia was an Internet search engine. The company invented an alternative new infrastructure to indexing that used SemanticRank algorithm, a solution mix from the disciplines of ontological semantics, fuzzy logic, computational linguistics, and mathematics. hakia closed in 2014. Lexxe – Lexxe was an Internet search engine that used natural-language processing for queries (semantic search). Searches could be made with keywords, phrases, and questions, such as "How old is Wikipedia?" Lexxe closed its search engine services in 2015. Pikimal – Pikimal used natural-language tied to user preference to make search recommendations by template. Pikimal closed in 2015. Powerset – On May 11, 2008, the company unveiled a tool for searching a fixed subset of Wikipedia using conversational phrases rather than keywords. On July 1, 2008, it was purchased by

    Read more →