AI Art News

AI Art News — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Saliency map

    Saliency map

    In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map. == Application == === Overview === Saliency maps have applications in a variety of different problems. Some general applications: ==== Human eye ==== Image and video compression: The human eye focuses only on a small region of interest in the frame. Therefore, it is not necessary to compress the entire frame with uniform quality. According to the authors, using a salience map reduces the final size of the video with the same visual perception. Image and video quality assessment: The main task for an image or video quality metric is a high correlation with user opinions. Differences in salient regions are given more importance and thus contribute more to the quality score. Image retargeting: It aims at resizing an image by expanding or shrinking the noninformative regions. Therefore, retargeting algorithms rely on the availability of saliency maps that accurately estimate all the salient image details. Object detection and recognition: Instead of applying a computationally complex algorithm to the whole image, we can use it to the most salient regions of an image most likely to contain an object. the primary visual cortex (V1) appears to be responsible for the saliency map, according to the V1 Saliency Hypothesis. ==== Explainable artificial intelligence ==== Saliency maps are a prominent tool in explainable artificial intelligence, providing visual explanations of the decision-making process of machine learning models, particularly deep neural networks. These maps highlight the regions in input data that are most influential on the model's output, effectively indicating where the model is "looking" when making a prediction. In image classification tasks, for example, saliency maps can identify pixels or regions that contribute most to a specific class decision. Developed for convolutional neural networks, saliency mapping techniques range from simply taking the gradient of the class score with respect to the input data to more complex algorithms, such as integrated gradients and class activation mapping. In transformer architecture, attention mechanisms led to analogous saliency maps, such as attention maps, attention rollouts, and class-discriminative attention maps. === Saliency as a segmentation problem === Saliency estimation may be viewed as an instance of image segmentation. In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as superpixels). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. == Algorithms == === Overview === There are three forms of classic saliency estimation algorithms implemented in OpenCV: Static saliency: Relies on image features and statistics to localize the regions of interest of an image. Motion saliency: Relies on motion in a video, detected by optical flow. Objects that move are considered salient. Objectness: Objectness reflects how likely an image window covers an object. These algorithms generate a set of bounding boxes of where an object may lie in an image. In addition to classic approaches, neural-network-based are also popular. There are examples of neural networks for motion saliency estimation: TASED-Net: It consists of two building blocks. First, the encoder network extracts low-resolution spatiotemporal features, and then the following prediction network decodes the spatially encoded features while aggregating all the temporal information. STRA-Net: It emphasizes two essential issues. First, spatiotemporal features integrated via appearance and optical flow coupling, and then multi-scale saliency learned via attention mechanism. STAViS: It combines spatiotemporal visual and auditory information. This approach employs a single network that learns to localize sound sources and to fuse the two saliencies to obtain a final saliency map. There's a new static saliency in the literature with name visual distortion sensitivity. It is based on the idea that the true edges, i.e. object contours, are more salient than the other complex textured regions. It detects edges in a different way from the classic edge detection algorithms. It uses a fairly small threshold for the gradient magnitudes to consider the mere presence of the gradients. So, it obtains 4 binary maps for vertical, horizontal and two diagonal directions. The morphological closing and opening are applied to the binary images to close the small gaps. To clear the blob-like shapes, it utilizes the distance transform. After all, the connected pixel groups are individual edges (or contours). A threshold of size of connected pixel set is used to determine whether an image block contains a perceivable edge (salient region) or not. === Example implementation === First, we should calculate the distance of each pixel to the rest of pixels in the same frame: S A L S ( I k ) = ∑ i = 1 N | I k − I i | {\displaystyle \mathrm {SALS} (I_{k})=\sum _{i=1}^{N}|I_{k}-I_{i}|} I i {\displaystyle I_{i}} is the value of pixel i {\displaystyle i} , in the range of [0,255]. The following equation is the expanded form of this equation. SALS(Ik) = |Ik - I1| + |Ik - I2| + ... + |Ik - IN| Where N is the total number of pixels in the current frame. Then we can further restructure our formula. We put the value that has same I together. SALS(Ik) = Σ Fn × |Ik - In| Where Fn is the frequency of In. And the value of n belongs to [0,255]. The frequencies is expressed in the form of histogram, and the computational time of histogram is ⁠ O ( N ) {\displaystyle O(N)} ⁠ time complexity. ==== Time complexity ==== This saliency map algorithm has ⁠ O ( N ) {\displaystyle O(N)} ⁠ time complexity. Since the computational time of histogram is ⁠ O ( N ) {\displaystyle O(N)} ⁠ time complexity which N is the number of pixel's number of a frame. Besides, the minus part and multiply part of this equation need 256 times operation. Consequently, the time complexity of this algorithm is ⁠ O ( N + 256 ) {\displaystyle O(N+256)} ⁠ which equals to ⁠ O ( N ) {\displaystyle O(N)} ⁠. ==== Pseudocode ==== All of the following code is pseudo MATLAB code. First, read data from video sequences. After we read data, we do superpixel process to each frame. Spnum1 and Spnum2 represent the pixel number of current frame and previous pixel. Then we calculate the color distance of each pixel, this process we call it contract function. After this two process, we will get a saliency map, and then store all of these maps into a new FileFolder. ==== Difference in algorithms ==== The major difference between function one and two is the difference of contract function. If spnum1 and spnum2 both represent the current frame's pixel number, then this contract function is for the first saliency function. If spnum1 is the current frame's pixel number and spnum2 represent the previous frame's pixel number, then this contract function is for second saliency function. If we use the second contract function which using the pixel of the same frame to get center distance to get a saliency map, then we apply this saliency function to each frame and use current frame's saliency map minus previous frame's saliency map to get a new image which is the new saliency result of the third saliency function. == Datasets == The saliency dataset usually contains human eye movements on some image sequences. It is valuable for new saliency algorithm creation or benchmarking the existing one. The most valuable dataset parameters are spatial resolution, size, and eye-tracking equipment. Here is part of the large datasets table from MIT/Tübingen Saliency Benchmark datasets, for example. To collect a saliency dataset, image or video sequences and eye-tracking equipment must be prepared, and observers must be invited. Observers must have normal or corrected to normal vision and must be at the same distance from the screen. At the beginning of each recording session, the eye-tracker recalibrates. To do this, the observer fixates their gaze on the screen center. The session is then started, and saliency data are collected by showing sequences and recording eye gazes. The eye-tracking device is a high-speed camera, capable of recording eye movements at least 250 fr

    Read more →
  • Albert One

    Albert One

    Albert One is an artificial intelligence chatbot created by Robby Garner and designed to mimic the way humans make conversations using a multi-faceted approach in natural language programming. == History == In both 1998 and 1999, Albert One won the Loebner Prize Contest, a competition between chatterbots. Some parts of Albert were deployed on the internet beginning in 1995, to gather information about what kinds of things people would say to a chatterbot. Another element of Albert One involved the building of a large database of human statements, and associated replies. This portion of the project was tested at the 1994-1997 Loebner Prize contests. Albert was the first of Robby Garner's multifaceted bots. The Albert One system was composed of several subsystems. Among those were a version of Eliza, the therapist, Elivs, another Eliza-like bot, and several other helper applications working together in a hierarchical arrangement. As a continuation of the stimulus-response library, various other database queries and assertions were tested to arrive at each of Albert's responses. Robby went on to develop networked examples of this kind of hierarchical "glue" at The Turing Hub.

    Read more →
  • Uncertain data

    Uncertain data

    In computer science, uncertain data is data that contains noise that makes it deviate from the correct, intended or original values. In the age of big data, uncertainty or data veracity is one of the defining characteristics of data. Data is constantly growing in volume, variety, velocity and uncertainty (1/veracity). Uncertain data is found in abundance today on the web, in sensor networks, within enterprises both in their structured and unstructured sources. For example, there may be uncertainty regarding the address of a customer in an enterprise dataset, or the temperature readings captured by a sensor due to aging of the sensor. In 2012 IBM called out managing uncertain data at scale in its global technology outlook report that presents a comprehensive analysis looking three to ten years into the future seeking to identify significant, disruptive technologies that will change the world. In order to make confident business decisions based on real-world data, analyses must necessarily account for many different kinds of uncertainty present in very large amounts of data. Analyses based on uncertain data will have an effect on the quality of subsequent decisions, so the degree and types of inaccuracies in this uncertain data cannot be ignored. Uncertain data is found in the area of sensor networks; text where noisy text is found in abundance on social media, web and within enterprises where the structured and unstructured data may be old, outdated, or plain incorrect; in modeling where the mathematical model may only be an approximation of the actual process. When representing such data in a database, an appropriate uncertain database model needs to be selected. == Example data model for uncertain data == One way to represent uncertain data is through probability distributions. Let us take the example of a relational database. There are three main ways to do represent uncertainty as probability distributions in such a database model. In attribute uncertainty, each uncertain attribute in a tuple is subject to its own independent probability distribution. For example, if readings are taken of temperature and wind speed, each would be described by its own probability distribution, as knowing the reading for one measurement would not provide any information about the other. In correlated uncertainty, multiple attributes may be described by a joint probability distribution. For example, if readings are taken of the position of an object, and the x- and y-coordinates stored, the probability of different values may depend on the distance from the recorded coordinates. As distance depends on both coordinates, it may be appropriate to use a joint distribution for these coordinates, as they are not independent. In tuple uncertainty, all the attributes of a tuple are subject to a joint probability distribution. This covers the case of correlated uncertainty, but also includes the case where there is a probability of a tuple not belonging in the relevant relation, which is indicated by all the probabilities not summing to one. For example, assume we have the following tuple from a probabilistic database: Then, the tuple has 10% chance of not existing in the database.

    Read more →
  • Google Mobile Services

    Google Mobile Services

    Google Mobile Services (GMS) is a collection of proprietary applications and application programming interfaces (APIs) services from Google that are typically pre-installed on the majority of Android devices, such as smartphones, tablets, and smart TVs. GMS is not a part of the Android Open Source Project (AOSP), which means an Android manufacturer needs to obtain a license from Google in order to legally pre-install GMS on an Android device. This license is provided by Google without any licensing fees except in the EU. == Core applications == The following are core applications that are part of Google Mobile Services: Google Search Google Chrome YouTube Google Play Google Drive Gmail Google Meet Google Maps Google Photos Google TV YouTube Music === Historically === Google+ Google Hangouts Google Wallet Google Play Magazines Google Play Music Google Play Movies & TV Google Duo == Reception, competitors, and regulators == === FairSearch === Numerous European firms filed a complaint to the European Commission stating that Google had manipulated their power and dominance within the market to push their Services to be used by phone manufacturers. The firms were joined under the name FairSearch, and the main firms included were Microsoft, Expedia, TripAdvisor, Nokia and Oracle. FairSearch's major problem with Google's practices was that they believed Google were forcing phone manufacturers to use their Mobile Services. They claimed Google managed this by asking these manufacturers to sign a contract stating that they must preinstall specific Google Mobile Services, such as Maps, Search and YouTube, in order to get the latest version of Android. Google swiftly responded stating that they "continue to work co-operatively with the European Commission". === Aptoide === The third-party Android app store Aptoide also filed an EU competition complaint against Google once again stating that they are misusing their power within the market. Aptoide alleged that Google was blocking third-party app stores from being on Google Play, as well as blocking Google Chrome from downloading any third-party apps and app stores. As of June 2014, Google had not responded to these allegations. === Abuse of Android dominance === In May 2019, Umar Javeed, Sukarma Thapar, Aaqib Javeed vs. Google LLC & Ors. the Competition Commission of India ordered an antitrust probe against Google for abusing its dominant position with Android to block market rivals. In Prima Facie opinion the commission held that mandatory pre-installation of the entire Google Mobile Services (GMS) suite, under Mobile Application Distribution Agreements (MADA), amounts to the imposition of unfair conditions on the device manufacturers. === EU antitrust ruling === On July 18, 2018, the European Commission fined Google €4.34 billion for breaching EU antitrust rules which resulted in a change of licensing policy for the GMS in the EU. A new paid licensing agreement for smartphones and tablets shipped into the EEA was created. The change is that the GMS is now decoupled from the base Android and will be offered under a separate paid licensing agreement. === Privacy policy === At the same time, Google faced problems with various European data protection agencies, most notably In the United Kingdom and France. The problem they faced was that they had a set of 60 rules merged into one, which allowed Google to "track users more closely". Google once again came out and stated that their new policies still abide by European Union laws. === Android distributions without Google Mobile Services === After surveillance and privacy concerns, several custom android distributions have been implemented, such as GrapheneOS, LineageOS, CalyxOS, iodéOS or /e/OS, and they come either without any GMS installed by default or with microG, that adds a compatibility layer.

    Read more →
  • Scroll (web service)

    Scroll (web service)

    Scroll was a subscription-based web service developed by Scroll Labs Inc., offering ad-free access to websites in exchange for a fee. Scroll was not an ad blocker; instead, it partnered directly with internet publishers who voluntarily removed ads from their sites for Scroll users in exchange for a portion of the subscription fee. In May 2021, Scroll was acquired by Twitter. In October 2021, Scroll sent out an email announcing its integration into Twitter Blue within 30 days. == Functionality == Scroll enabled users to browse websites that partnered with Scroll without encountering online advertising, in exchange for a subscription fee. Unlike ad blocker, which disable advertisements without compensating the publisher, Scroll sent a browser cookie indicating that the user was a subscriber. The Scroll software integrated into the website detected this cookie and served an ad-free version of the site. In exchange for disabling advertisements, partner websites received a portion of the subscription fee. As of January 2020, Scroll retained 30% of the subscription fee, with the remaining 70% distributed among publisher sites. Payments to sites were made individually by users based on their 'engagement and loyalty,' rather than from a single pool of all subscription revenue. Scroll did not grant subscribers access to partner sites behind a paywall; it only removed ads from the site if the user also paid the publication's subscription fee. == History == Scroll was founded in 2016 by former Chartbeat Chief Executive Tony Haile. Scroll raised US$3 million in its first round of funding in 2016, including investments from The New York Times, Uncork Capital, and Axel Springer SE. By October 2018, Scroll had raised US$10 million in funding. In 2018, Scroll signed its first partner websites, which included The Atlantic, Fusion Media Group, Business Insider, Slate, MSNBC, The Philadelphia Inquirer, and Talking Points Memo. In February 2019, Scroll acquired the social media curation app Nuzzel. The same month, Mozilla and Scroll announced a partnership to run a "test pilot" together, but did not go into details. Scroll entered beta testing in 2019 and launched to the general public on January 28, 2020. In March 2020, Mozilla started offering Scroll as part of its "Firefox Better Web" service bundle. In May 2021, Scroll was acquired by Twitter, with the future of Scroll cited as being uncertain. An email to customers announcing the change said, "Later this year, Scroll will become part of a wider Twitter subscription that will expand on and adapt our services and functionality".

    Read more →
  • DBGallery

    DBGallery

    DBGallery, short for Database Gallery, is a cloud-based Software as a Service (SaaS) and on-prem webserver for teams of various sizes. DBGallery enables users to centrally store, manage, catalog, archive, and securely share image, video, and document files. It facilitates version control, detects duplicates, and offers an intuitive and advanced search functionality, making assets easily accessible to all users. It takes advantage of current AI technologies to automatically add significant metadata to images, facilitates custom-trained AI models, and offers bespoke AI features. Additionally, DBGallery provides team management tools, workflow management, an activity audit trail, and other collaborative features that foster a productive environment for both internal and external stakeholders. == History == DBGallery's first public release was December 2007. Since then each year has seen continuous enhancements. 2013 added support for additional non-English languages in its meta-data. 2014 added support for creating custom data fields for tagging and search. In 2015 included the ability to auto-tag images using Reverse Geocoding. 2018 added artificial intelligence (AI) image recognition as a further addition to auto-tagging. March 2020 added complete image collection management via the web (e.g. file and folder drag and drop), a new collection dashboard, custom data layouts, and an improved audit trail. 2021 saw user experience improvements provided by improved styling and performance enhancements. Version 12 was released in October 2021. It added the ability to upload unlimited file sizes and made significant performance improvements for very large collections. June 2022 saw the release of a global duplicate images search. In late 2022, DBGallery began offering significantly reduced cloud storage cost, at a third of its previous prices, which played into its recent high-volume/high-capacity capabilities and its clients' subsequent demand for additional storage. 2023 saw improvements in user and role management, introduced it's mobile app (PWA), and improved custom-trained object detection. Release 14.0 in the spring of 2024 had large sharing improvements and a new find related images feature. Winter 2025's v15 release introduced AI-generated image descriptions, image-to-text, and facial recognition.

    Read more →
  • Vector database

    Vector database

    A vector database, vector store or vector search engine is a database that stores and retrieves embeddings of data in vector space. Vector databases typically implement approximate nearest neighbor algorithms so users can search for records semantically similar to a given input, unlike traditional databases which primarily look up records by exact match. Use-cases for vector databases include similarity search, semantic search, multi-modal search, recommendations engines, object detection, and retrieval-augmented generation (RAG). Vector embeddings are mathematical representations of data in a high-dimensional space. In this space, each dimension corresponds to a feature of the data, with the number of dimensions ranging from a few hundred to tens of thousands, depending on the complexity of the data being represented. Each data item is represented by one vector in this space. Words, phrases, or entire documents, as well as images, audio, and other types of data, can all be vectorized. These feature vectors may be computed from the raw data using machine learning methods such as feature extraction algorithms, word embeddings or deep learning networks. The goal is that semantically similar data items receive feature vectors close to each other. Vector retrieval can be combined with metadata filtering or lexical search to support filtered and hybrid retrieval workflows. == Techniques == Common techniques for similarity search on high-dimensional vectors include: Hierarchical Navigable Small World (HNSW) graphs Locality-sensitive hashing (LSH) and sketching Product quantization (PQ) Inverted files These techniques may also be combined in vector search systems. In recent benchmarks, HNSW-based implementations have been among the best performers. Conferences such as the International Conference on Similarity Search and Applications (SISAP) and the Conference on Neural Information Processing Systems (NeurIPS) have hosted competitions on vector search in large databases. == Applications == Vector databases are used in a wide range of machine learning applications including similarity search, semantic search, multi-modal search, recommendations engines, object detection, and retrieval-augmented generation. === Retrieval-augmented generation === An especially common use-case for vector databases is in retrieval-augmented generation (RAG), a method to improve domain-specific responses of large language models. The retrieval component of a RAG can be any search system, but is most often implemented as a vector database. Text documents describing the domain of interest are collected, and for each document or document section, a feature vector (known as an "embedding") is computed, typically using a deep learning network, and stored in a vector database along with a link to the document. Given a user prompt, the feature vector of the prompt is computed, and the database is queried to retrieve the most relevant documents. These are then automatically added into the context window of the large language model, and the large language model proceeds to create a response to the prompt given this context. == Implementations ==

    Read more →
  • Ericom Connect

    Ericom Connect

    Ericom Connect is a remote access/application publishing solution produced by Ericom Software that provides secure, centrally managed access to physical or hosted desktops and applications running on Microsoft Windows and Linux systems. == Product overview == Ericom Connect is desktop virtualization and application virtualization software that allows users to run applications remotely, without installing them on the local computer or device. The software is noted for its scalability, ease of deployment, and compatibility with any type of infrastructure, cloud or physical. Ericom Connect uses AccessPad (native client for desktops), AccessToGo (native client for mobile), or AccessNow, one of the first HTML5 RDP solutions to support clientless access to Windows desktops and applications from any device with an HTML5-compatible browser, including Macintosh computers, mobile devices, and Google Chromebooks. Other notable features include performance monitoring, built-in real-time analytics & BI, support for two-factor authentication (using RSA SecurID), multi-tenancy and multi-datacenter support via a single unified web interface, and a “Launch Simulation” feature that allows users to visualize and simulate actual step-by-step user processes directly from within the administration console. In addition to scalability, by distributing configurations, logs, etc., across multiple servers there is no single point of failure, as can be the case if all configuration information is stored on one server. == History == Ericom Connect was introduced in 2015. Ericom Connect is a successor to Ericom PowerTerm Web Connect. PowerTerm Web Connect used an architecture similar to what was then current with Citrix and VMWare, relying on a centralized SQL server, a connection broker, image management for different hypervisors, and a variety of clients. Ericom Connect uses a new grid architecture that provides more scalability, reliability, and flexibility than before.

    Read more →
  • VSCO

    VSCO

    VSCO ( ), formerly known as VSCO Cam, is a photography mobile app available for iOS and Android devices. The app was created by Joel Flory and Greg Lutze. The VSCO app allows users to capture photos in the app and edit them, using preset filters and editing tools. == History == Visual Supply Company was founded by Joel Flory and Greg Lutze in California, in 2011. VSCO was launched in 2012. It raised $40 million from investors in May 2014. In 2017, VSCO launched a subscription model. As of 2018, Visual Supply Company has $90 million in funding from investors and over 2 million paying members. In 2019, VSCO acquired Rylo, a video editing startup founded by the original developer of Instagram’s Hyperlapse. Visual Supply Company has locations in Oakland, California, where it is headquartered, and Chicago, Illinois. In December 2020 VSCO acquired AI-powered video editing app Trash. In April 2018, VSCO reached over 30 million users. In September 2023, Eric Wittman was appointed as the new CEO and co-founder Joel Flory became executive chairman. == Usage == Users must register an account to use the app. Photos can be taken or imported from the camera roll, as well as short videos or animated GIFs (known in the app as DSCO; iOS only). The user can edit their photos through various preset filters, or through the "toolkit" feature which allows finer adjustments to fade, clarity, skin tone, tint, sharpness, saturation, contrast, temperature, exposure, and other properties. Users have the option of posting their photos to their profile, where they can also add captions and hashtags. Photos can also be exported back into the camera roll or shared with other social networking services. The users also have an option to edit their own videos from their camera roll with the VSCO yearly membership, but they are not able to post camera roll as VSCO Film X videos to their account on VSCO. JPEG and raw image files can be used. Research on image based social media platforms has found that engagement with posting, editing, and interacting with images can influence users' mood, self esteem, and body satisfaction. Studies also suggest that greater emotional investment in social media content is associated with increased negative psychological outcomes including stress and depressive symptoms. == In popular culture == VSCO's Oakland headquarters was a key filming location for Boots Riley's 2018 film Sorry to Bother You.

    Read more →
  • DryvIQ

    DryvIQ

    DryvIQ is a software application that enables businesses to migrate on-site system files and associated data across storage and content management platforms, as well as create synchronized hybrid storage systems. == History == Before it was DryvIQ, the software SkySync was released in 2013 by Ann Arbor, Michigan based company, Portal Architects, Inc. The company created SkySync, a back-end, administrative application designed to transfer content across storage platforms, after abandoning 18 months of development on a desktop application called SkyBrary in 2011. Between 2014 and 2015, Portal Architects established partnerships with the following companies: Autodesk, Box, Dropbox, Egnyte, EMC, Google, Syncplicity, Huddle, IBM, Microsoft, OpenText, Oracle, Citrix ShareFile, Hightail and Internet2. SkySync (currently DryvIQ) was named a "Cool Vendor in Content Management" by Gartner in 2015. In 2022, SkySync changed its name to DryvIQ, which is now what the company is currently known as. == Overview == DryvIQ is a software application that syncs, migrates or backs up files including their associated properties, metadata, versions, user accounts and permissions across on-premises and Cloud-based storage platforms. The software deploys on a server, virtual machine or within Microsoft Azure, Amazon Web Services or other cloud computing services.

    Read more →
  • Vision transformer

    Vision transformer

    A vision transformer (ViT) is a transformer designed for computer vision. A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings. ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications. They have different inductive biases, training stability, and data efficiency. Compared to CNNs, ViTs are less data efficient, but have higher capacity. Some of the largest modern computer vision models are ViTs, such as one with 22B parameters. Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViTs and CNNs. ViTs have found application in image recognition, image segmentation, weather prediction, and autonomous driving. == History == Transformers were introduced in Attention Is All You Need (2017), and have found widespread use in natural language processing. A 2019 paper applied ideas from the Transformer to computer vision. Specifically, they started with a ResNet, a standard convolutional neural network used for computer vision, and replaced all convolutional kernels by the self-attention mechanism found in a Transformer. It resulted in superior performance. However, it is not a Vision Transformer. In 2020, an encoder-only Transformer was adapted for computer vision, yielding the ViT, which reached state of the art in image classification, overcoming the previous dominance of CNN. The masked autoencoder (2022) extended ViT to work with unsupervised training. The vision transformer and the masked autoencoder, in turn, stimulated new developments in convolutional neural networks. Subsequently, there was cross-fertilization between the previous CNN approach and the ViT approach. In 2021, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Two studies improved efficiency and robustness of ViT by adding a CNN as a preprocessor. The Swin Transformer achieved state-of-the-art results on some object detection datasets such as COCO, by using convolution-like sliding windows of attention mechanism, and the pyramid process in classical computer vision. == Overview == The basic architecture, used by the original 2020 paper, is as follows. In summary, it is a BERT-like encoder-only Transformer. The input image is of type R H × W × C {\displaystyle \mathbb {R} ^{H\times W\times C}} , where H , W , C {\displaystyle H,W,C} are height, width, channel (RGB). It is then split into square-shaped patches of type R P × P × C {\displaystyle \mathbb {R} ^{P\times P\times C}} . For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding" (the paper tried no embedding, 1D embedding, 2D embedding, and relative embedding: 1D was adopted). The two vectors are added, then pushed through several Transformer encoders. The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics. The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them. For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network. == Variants == === Original ViT === The original ViT was an encoder-only Transformer supervise-trained to predict the image label from the patches of the image. As in the case of BERT, it uses a special token in the input side, and the corresponding output vector is used as the only input of the final output MLP head. The special token is an architectural hack to allow the model to compress all information relevant for predicting the image label into one vector. Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet, DenseNet, and Inception. Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer. === Architectural improvements === ==== Pooling ==== After the ViT processes an image, it produces some embedding vectors. These must be converted to a single class probability prediction by some kind of network. In the original ViT and Masked Autoencoder, they used a dummy [CLS] token, in emulation of the BERT language model. The output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution. Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good. Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors x 1 , x 2 , … , x n {\displaystyle x_{1},x_{2},\dots ,x_{n}} , which might be thought of as the output vectors of a layer of a ViT. The output from MAP is M u l t i h e a d e d A t t e n t i o n ( Q , V , V ) {\displaystyle \mathrm {MultiheadedAttention} (Q,V,V)} , where q {\displaystyle q} is a trainable query vector, and V {\displaystyle V} is the matrix with rows being x 1 , x 2 , … , x n {\displaystyle x_{1},x_{2},\dots ,x_{n}} . This was first proposed in the Set Transformer architecture. Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling. A variant of MAP was proposed as class attention, which applies MAP, then feedforward, then MAP again. Re-attention was proposed to allow training deep ViT. It changes the multiheaded attention module. === Masked Autoencoder === The Masked Autoencoder took inspiration from denoising autoencoders and context encoders. It has two ViTs put end-to-end. The first one ("encoder") takes in image patches with positional encoding, and outputs vectors representing each patch. The second one (called "decoder", even though it is still an encoder-only Transformer) takes in vectors with positional encoding and outputs image patches again. ==== Training ==== During training, input images (224px x 224 px in the original implementation) are split along a designated number of lines on each axis, producing image patches. A certain percentage of patches are selected to be masked out by mask tokens, while all others are retained in the image. The network is tasked with reconstructing the image from the remaining unmasked patches. Mask tokens in the original implementation are learnable vector quantities. A linear projection with positional embeddings is then applied to the vector of unmasked patches. Experiments varying mask ratio on networks trained on the ImageNet-1K dataset found 75% mask ratios achieved high performance on both finetuning and linear-probing of the encoder's latent space. The MAE processes only unmasked patches during training, increasing the efficiency of data processing in the encoder and lowering the memory usage of the transformer. A less computationally-intensive ViT is used for the decoder in the original implementation of the MAE. Masked patches are added back to the output of the encoder block as mask tokens and both are fed into the decoder. A reconstruction loss is computed for the masked patches to assess network performance. ==== Prediction ==== In prediction, the decoder architecture is discarded entirely. The input image is split into patches by the same algorithm as in training, but no patches are masked out. A linear projection wi

    Read more →
  • Phase correlation

    Phase correlation

    Phase correlation is an approach to estimate the relative translative offset between two similar images (digital image correlation) or other data sets. It is commonly used in image registration and relies on a frequency-domain representation of the data, usually calculated by fast Fourier transforms. The term is applied particularly to a subset of cross-correlation techniques that isolate the phase information from the Fourier-space representation of the cross-correlogram. == Example == The following image demonstrates the usage of phase correlation to determine relative translative movement between two images corrupted by independent Gaussian noise. The image was translated by (20,23) pixels. Accordingly, one can clearly see a peak in the phase-correlation representation at approximately (20,23). == Method == Given two input images g a {\displaystyle \ g_{a}} and g b {\displaystyle \ g_{b}} : Apply a window function (e.g., a Hamming window) on both images to reduce edge effects (this may be optional depending on the image characteristics). Then, calculate the discrete 2D Fourier transform of both images. G a = F { g a } , G b = F { g b } {\displaystyle \ \mathbf {G} _{a}={\mathcal {F}}\{g_{a}\},\;\mathbf {G} _{b}={\mathcal {F}}\{g_{b}\}} Calculate the cross-power spectrum by taking the complex conjugate of the second result, multiplying the Fourier transforms together elementwise, and normalizing this product elementwise. R = G a ∘ G b ∗ | G a ∘ G b ∗ | {\displaystyle \ R={\frac {\mathbf {G} _{a}\circ \mathbf {G} _{b}^{}}{|\mathbf {G} _{a}\circ \mathbf {G} _{b}^{}|}}} Where ∘ {\displaystyle \circ } is the Hadamard product (entry-wise product) and the absolute values are taken entry-wise as well. Written out entry-wise for element index ( j , k ) {\displaystyle (j,k)} : R j k = G a , j k ⋅ G b , j k ∗ | G a , j k ⋅ G b , j k ∗ | {\displaystyle \ R_{jk}={\frac {G_{a,jk}\cdot G_{b,jk}^{}}{|G_{a,jk}\cdot G_{b,jk}^{}|}}} Obtain the normalized cross-correlation by applying the inverse Fourier transform. r = F − 1 { R } {\displaystyle \ r={\mathcal {F}}^{-1}\{R\}} Determine the location of the peak in r {\displaystyle \ r} . ( Δ x , Δ y ) = arg ⁡ max ( x , y ) { r } {\displaystyle \ (\Delta x,\Delta y)=\arg \max _{(x,y)}\{r\}} === Subpixel registration === Commonly, interpolation methods are used to estimate the peak location in the cross-correlogram to non-integer values, despite the fact that the data are discrete, and this procedure is often termed 'subpixel registration'. A large variety of subpixel interpolation methods are given in the technical literature. Common peak interpolation methods such as parabolic interpolation have been used, and the OpenCV computer vision package uses a centroid-based method, though these generally have inferior accuracy compared to more sophisticated methods. Because the Fourier representation of the data has already been computed, it is especially convenient to use the Fourier shift theorem with real-valued (sub-integer) shifts for this purpose, which essentially interpolates using the sinusoidal basis functions of the Fourier transform. An especially popular FT-based estimator is given by Foroosh et al. In this method, the subpixel peak location is approximated by a simple formula involving peak pixel value and the values of its nearest neighbors, where r ( 0 , 0 ) {\displaystyle r_{(0,0)}} is the peak value and r ( 1 , 0 ) {\displaystyle r_{(1,0)}} is the nearest neighbor in the x direction (assuming, as in most approaches, that the integer shift has already been found and the comparand images differ only by a subpixel shift). Δ x = r ( 1 , 0 ) r ( 1 , 0 ) ± r ( 0 , 0 ) {\displaystyle \ \Delta x={\frac {r_{(1,0)}}{r_{(1,0)}\pm r_{(0,0)}}}} The Foroosh et al. method is quite fast compared to most methods, though it is not always the most accurate. Some methods shift the peak in Fourier space and apply non-linear optimization to maximize the correlogram peak, but these tend to be very slow since they must apply an inverse Fourier transform or its equivalent in the objective function. It is also possible to infer the peak location from phase characteristics in Fourier space without the inverse transformation, as noted by Stone. These methods usually use a linear least squares (LLS) fit of the phase angles to a planar model. The long latency of the phase angle computation in these methods is a disadvantage, but the speed can sometimes be comparable to the Foroosh et al. method depending on the image size. They often compare favorably in speed to the multiple iterations of extremely slow objective functions in iterative non-linear methods. Since all subpixel shift computation methods are fundamentally interpolative, the performance of a particular method depends on how well the underlying data conform to the assumptions in the interpolator. This fact also may limit the usefulness of high numerical accuracy in an algorithm, since the uncertainty due to interpolation method choice may be larger than any numerical or approximation error in the particular method. Subpixel methods are also particularly sensitive to noise in the images, and the utility of a particular algorithm is distinguished not only by its speed and accuracy but its resilience to the particular types of noise in the application. == Rationale == The method is based on the Fourier shift theorem. Let the two images g a {\displaystyle \ g_{a}} and g b {\displaystyle \ g_{b}} be circularly-shifted versions of each other: g b ( x , y ) = d e f g a ( ( x − Δ x ) mod M , ( y − Δ y ) mod N ) {\displaystyle \ g_{b}(x,y)\ {\stackrel {\mathrm {def} }{=}}\ g_{a}((x-\Delta x){\bmod {M}},(y-\Delta y){\bmod {N}})} (where the images are M × N {\displaystyle \ M\times N} in size). Then, the discrete Fourier transforms of the images will be shifted relatively in phase: G b ( u , v ) = G a ( u , v ) e − 2 π i ( u Δ x M + v Δ y N ) {\displaystyle \mathbf {G} _{b}(u,v)=\mathbf {G} _{a}(u,v)e^{-2\pi i({\frac {u\Delta x}{M}}+{\frac {v\Delta y}{N}})}} One can then calculate the normalized cross-power spectrum to factor out the phase difference: R ( u , v ) = G a G b ∗ | G a G b ∗ | = G a G a ∗ e 2 π i ( u Δ x M + v Δ y N ) | G a G a ∗ e 2 π i ( u Δ x M + v Δ y N ) | = G a G a ∗ e 2 π i ( u Δ x M + v Δ y N ) | G a G a ∗ | = e 2 π i ( u Δ x M + v Δ y N ) {\displaystyle {\begin{aligned}R(u,v)&={\frac {\mathbf {G} _{a}\mathbf {G} _{b}^{}}{|\mathbf {G} _{a}\mathbf {G} _{b}^{}|}}\\&={\frac {\mathbf {G} _{a}\mathbf {G} _{a}^{}e^{2\pi i({\frac {u\Delta x}{M}}+{\frac {v\Delta y}{N}})}}{|\mathbf {G} _{a}\mathbf {G} _{a}^{}e^{2\pi i({\frac {u\Delta x}{M}}+{\frac {v\Delta y}{N}})}|}}\\&={\frac {\mathbf {G} _{a}\mathbf {G} _{a}^{}e^{2\pi i({\frac {u\Delta x}{M}}+{\frac {v\Delta y}{N}})}}{|\mathbf {G} _{a}\mathbf {G} _{a}^{}|}}\\&=e^{2\pi i({\frac {u\Delta x}{M}}+{\frac {v\Delta y}{N}})}\end{aligned}}} since the magnitude of an imaginary exponential always is one, and the phase of G a G a ∗ {\displaystyle \ \mathbf {G} _{a}\mathbf {G} _{a}^{}} always is zero. The inverse Fourier transform of a complex exponential is a Dirac delta function, i.e. a single peak: r ( x , y ) = δ ( x + Δ x , y + Δ y ) {\displaystyle \ r(x,y)=\delta (x+\Delta x,y+\Delta y)} This result could have been obtained by calculating the cross correlation directly. The advantage of this method is that the discrete Fourier transform and its inverse can be performed using the fast Fourier transform, which is much faster than correlation for large images. === Benefits === Unlike many spatial-domain algorithms, the phase correlation method is resilient to noise, occlusions, and other defects typical of medical or satellite images. The method can be extended to determine rotation and scaling differences between two images by first converting the images to log-polar coordinates. Due to properties of the Fourier transform, the rotation and scaling parameters can be determined in a manner invariant to translation. === Limitations === In practice, it is more likely that g b {\displaystyle \ g_{b}} will be a simple linear shift of g a {\displaystyle \ g_{a}} , rather than a circular shift as required by the explanation above. In such cases, r {\displaystyle \ r} will not be a simple delta function, which will reduce the performance of the method. In such cases, a window function (such as a Gaussian or Tukey window) should be employed during the Fourier transform to reduce edge effects, or the images should be zero padded so that the edge effects can be ignored. If the images consist of a flat background, with all detail situated away from the edges, then a linear shift will be equivalent to a circular shift, and the above derivation will hold exactly. The peak can be sharpened by using edge or vector correlation. For periodic images (such as a chessboard or picket fence), phase correlation may yield ambiguous results with several peaks in the resulting output. == Applications == Phase correlation is the preferred m

    Read more →
  • Energy-based model

    Energy-based model

    An energy-based model (EBM), also called Canonical Ensemble Learning (CEL) or Learning via Canonical Ensemble (LCE), is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence. EBMs provide a unified framework for many probabilistic and non-probabilistic approaches to such learning, particularly for training graphical and other structured models. An EBM learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. Energy-based generative neural networks is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models, the energy functions of which are parameterized by modern deep neural networks. Boltzmann machines are a special form of energy-based models with a specific parametrization of the energy. == Description == For a given input x {\displaystyle x} , the model describes an energy E θ ( x ) {\displaystyle E_{\theta }(x)} such that the Boltzmann distribution P θ ( x ) = e − β E θ ( x ) Z ( θ ) {\displaystyle P_{\theta }(x)={e^{-\beta E_{\theta }(x)} \over Z(\theta )}} is a probability (density), and typically β = 1 {\displaystyle \beta =1} . Since the normalization constant: Z ( θ ) := ∫ x ∈ X e − β E θ ( x ) d x {\displaystyle Z(\theta ):=\int _{x\in X}e^{-\beta E_{\theta }(x)}dx} (also known as the partition function) depends on all the Boltzmann factors of all possible inputs x {\displaystyle x} , it cannot be easily computed or reliably estimated during training simply using standard maximum likelihood estimation. However, for maximizing the likelihood during training, the gradient of the log-likelihood of a single training example x {\displaystyle x} is given by using the chain rule: ∂ θ log ⁡ ( P θ ( x ) ) = E x ′ ∼ P θ [ ∂ θ E θ ( x ′ ) ] − ∂ θ E θ ( x ) ( ∗ ) {\displaystyle \partial _{\theta }\log \left(P_{\theta }(x)\right)=\mathbb {E} _{x'\sim P_{\theta }}[\partial _{\theta }E_{\theta }(x')]-\partial _{\theta }E_{\theta }(x)\,()} The expectation in the above formula for the gradient can be approximately estimated by drawing samples x ′ {\displaystyle x'} from the distribution P θ {\displaystyle P_{\theta }} using Markov chain Monte Carlo (MCMC). Early energy-based models, such as the 2003 Boltzmann machine by Hinton, estimated this expectation via blocked Gibbs sampling. Newer approaches make use of more efficient Stochastic Gradient Langevin Dynamics (LD), drawing samples using: x 0 ′ ∼ P 0 , x i + 1 ′ = x i ′ − α 2 ∂ E θ ( x i ′ ) ∂ x i ′ + ϵ {\displaystyle x_{0}'\sim P_{0},x_{i+1}'=x_{i}'-{\frac {\alpha }{2}}{\frac {\partial E_{\theta }(x_{i}')}{\partial x_{i}'}}+\epsilon } , where ϵ ∼ N ( 0 , α ) {\displaystyle \epsilon \sim {\mathcal {N}}(0,\alpha )} . A replay buffer of past values x i ′ {\displaystyle x_{i}'} is used with LD to initialize the optimization module. The parameters θ {\displaystyle \theta } of the neural network are therefore trained in a generative manner via MCMC-based maximum likelihood estimation: the learning process follows an "analysis by synthesis" scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method (e.g., Langevin dynamics or Hybrid Monte Carlo), and then updates the parameters θ {\displaystyle \theta } based on the difference between the training examples and the synthesized ones – see equation ( ∗ ) {\displaystyle ()} . This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation. Essentially, the model learns a function E θ {\displaystyle E_{\theta }} that associates low energies to correct values, and higher energies to incorrect values. After training, given a converged energy model E θ {\displaystyle E_{\theta }} , the Metropolis–Hastings algorithm can be used to draw new samples. The acceptance probability is given by: P a c c ( x i → x ∗ ) = min ( 1 , P θ ( x ∗ ) P θ ( x i ) ) . {\displaystyle P_{acc}(x_{i}\to x^{})=\min \left(1,{\frac {P_{\theta }(x^{})}{P_{\theta }(x_{i})}}\right).} == History == The term "energy-based models" was first coined in a 2003 JMLR paper where the authors defined a generalisation of independent components analysis to the overcomplete setting using EBMs. Other early work on EBMs proposed models that represented energy as a composition of latent and observable variables. == Characteristics == EBMs demonstrate useful properties: Simplicity and stability. The EBM is the only object that needs to be designed and trained. Separate networks need not be trained to ensure balance. Adaptive computation time. An EBM can generate sharp, diverse samples or (more quickly) coarse, less diverse samples. Given infinite time, this procedure produces true samples. Flexibility. In Variational Autoencoders (VAE) and flow-based models, the generator learns a map from a continuous space to a (possibly) discontinuous space containing different data modes. EBMs can learn to assign low energies to disjoint regions (multiple modes). Adaptive generation. EBM generators are implicitly defined by the probability distribution, and automatically adapt as the distribution changes (without training), allowing EBMs to address domains where generator training is impractical, as well as minimizing mode collapse and avoiding spurious modes from out-of-distribution samples. Compositionality. Individual models are unnormalized probability distributions, allowing models to be combined through product of experts or other hierarchical techniques. == Experimental results == On image datasets such as CIFAR-10 and ImageNet 32x32, an EBM model generated high-quality images relatively quickly. It supported combining features learned from one type of image for generating other types of images. It was able to generalize using out-of-distribution datasets, outperforming flow-based and autoregressive models. EBM was relatively resistant to adversarial perturbations, behaving better than models explicitly trained against them with training for classification. == Applications == Target applications include natural language processing, robotics and computer vision. The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network. The model has been generalized to various domains to learn distributions of videos, and 3D voxels. They are made more effective in their variants. They have proven useful for data generation (e.g., image synthesis, video synthesis, 3D shape synthesis, etc.), data recovery (e.g., recovering videos with missing pixels or image frames, 3D super-resolution, etc), data reconstruction (e.g., image reconstruction and linear interpolation ). == Alternatives == EBMs compete with techniques such as variational autoencoders (VAEs), generative adversarial networks (GANs) or normalizing flows. == Extensions == === Joint energy-based models === Joint energy-based models (JEM), proposed in 2020 by Grathwohl et al., allow any classifier with softmax output to be interpreted as energy-based model. The key observation is that such a classifier is trained to predict the conditional probability p θ ( y | x ) = e f → θ ( x ) [ y ] ∑ j = 1 K e f → θ ( x ) [ j ] for y = 1 , … , K and f → θ = ( f 1 , … , f K ) ∈ R K , {\displaystyle p_{\theta }(y|x)={\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{\sum _{j=1}^{K}e^{{\vec {f}}_{\theta }(x)[j]}}}\ \ {\text{ for }}y=1,\dotsc ,K{\text{ and }}{\vec {f}}_{\theta }=(f_{1},\dotsc ,f_{K})\in \mathbb {R} ^{K},} where f → θ ( x ) [ y ] {\displaystyle {\vec {f}}_{\theta }(x)[y]} is the y-th index of the logits f → {\displaystyle {\vec {f}}} corresponding to class y. Without any change to the logits it was proposed to reinterpret the logits to describe a joint probability density: p θ ( y , x ) = e f → θ ( x ) [ y ] Z ( θ ) , {\displaystyle p_{\theta }(y,x)={\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{Z(\theta )}},} with unknown partition function Z ( θ ) {\displaystyle Z(\theta )} and energy E θ ( x , y ) = − f θ ( x ) [ y ] {\displaystyle E_{\theta }(x,y)=-f_{\theta }(x)[y]} . By marginalization, we obtain the unnormalized density p θ ( x ) = ∑ y p θ ( y , x ) = ∑ y e f → θ ( x ) [ y ] Z ( θ ) =: e − E θ ( x ) , {\displaystyle p_{\theta }(x)=\sum _{y}p_{\theta }(y,x)=\sum _{y}{\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{Z(\theta )}}=:e^{-E_{\theta }(x)},} therefore, E θ ( x ) = − log ⁡ ( ∑ y e f → θ ( x ) [ y ] Z ( θ ) ) , {\displaystyle E_{\theta }(x)=-\log \left(\sum _{y}{\frac {e^{{\vec {f}}_{\theta }(x)[y]}}{Z(\theta )}}\right),} so that any classifier can be used to define an energy function E θ ( x ) {\displaystyle E_{\theta }(x)} .

    Read more →
  • Scale space implementation

    Scale space implementation

    In the areas of computer vision, image analysis and signal processing, the notion of scale-space representation is used for processing measurement data at multiple scales, and specifically enhance or suppress image features over different ranges of scale (see the article on scale space). A special type of scale-space representation is provided by the Gaussian scale space, where the image data in N dimensions is subjected to smoothing by Gaussian convolution. Most of the theory for Gaussian scale space deals with continuous images, whereas one when implementing this theory will have to face the fact that most measurement data are discrete. Hence, the theoretical problem arises concerning how to discretize the continuous theory while either preserving or well approximating the desirable theoretical properties that lead to the choice of the Gaussian kernel (see the article on scale-space axioms). This article describes basic approaches for this that have been developed in the literature, see also for an in-depth treatment regarding the topic of approximating the Gaussian smoothing operation and the Gaussian derivative computations in scale-space theory, and for a complementary treatment regarding hybrid discretization methods. == Statement of the problem == The Gaussian scale-space representation of an N-dimensional continuous signal, f C ( x 1 , ⋯ , x N , t ) , {\displaystyle f_{C}\left(x_{1},\cdots ,x_{N},t\right),} is obtained by convolving fC with an N-dimensional Gaussian kernel: g N ( x 1 , ⋯ , x N , t ) . {\displaystyle g_{N}\left(x_{1},\cdots ,x_{N},t\right).} In other words: L ( x 1 , ⋯ , x N , t ) = ∫ u 1 = − ∞ ∞ ⋯ ∫ u N = − ∞ ∞ f C ( x 1 − u 1 , ⋯ , x N − u N , t ) ⋅ g N ( u 1 , ⋯ , u N , t ) d u 1 ⋯ d u N . {\displaystyle L\left(x_{1},\cdots ,x_{N},t\right)=\int _{u_{1}=-\infty }^{\infty }\cdots \int _{u_{N}=-\infty }^{\infty }f_{C}\left(x_{1}-u_{1},\cdots ,x_{N}-u_{N},t\right)\cdot g_{N}\left(u_{1},\cdots ,u_{N},t\right)\,du_{1}\cdots du_{N}.} However, for implementation, this definition is impractical, since it is continuous. When applying the scale space concept to a discrete signal fD, different approaches can be taken. This article is a brief summary of some of the most frequently used methods. == Separability == Using the separability property of the Gaussian kernel g N ( x 1 , … , x N , t ) = G ( x 1 , t ) ⋯ G ( x N , t ) {\displaystyle g_{N}\left(x_{1},\dots ,x_{N},t\right)=G\left(x_{1},t\right)\cdots G\left(x_{N},t\right)} the N-dimensional convolution operation can be decomposed into a set of separable smoothing steps with a one-dimensional Gaussian kernel G along each dimension L ( x 1 , ⋯ , x N , t ) = ∫ u 1 = − ∞ ∞ ⋯ ∫ u N = − ∞ ∞ f C ( x 1 − u 1 , ⋯ , x N − u N , t ) G ( u 1 , t ) d u 1 ⋯ G ( u N , t ) d u N , {\displaystyle L(x_{1},\cdots ,x_{N},t)=\int _{u_{1}=-\infty }^{\infty }\cdots \int _{u_{N}=-\infty }^{\infty }f_{C}(x_{1}-u_{1},\cdots ,x_{N}-u_{N},t)G(u_{1},t)\,du_{1}\cdots G(u_{N},t)\,du_{N},} where G ( x , t ) = 1 2 π t e − x 2 2 t {\displaystyle G(x,t)={\frac {1}{\sqrt {2\pi t}}}e^{-{\frac {x^{2}}{2t}}}} and the standard deviation of the Gaussian σ is related to the scale parameter t according to t = σ2. Separability will be assumed in all that follows, even when the kernel is not exactly Gaussian, since separation of the dimensions is the most practical way to implement multidimensional smoothing, especially at larger scales. Therefore, the rest of the article focuses on the one-dimensional case. == The sampled Gaussian kernel == When implementing the one-dimensional smoothing step in practice, the presumably simplest approach is to convolve the discrete signal fD with a sampled Gaussian kernel: L ( x , t ) = ∑ n = − ∞ ∞ f ( x − n ) G ( n , t ) {\displaystyle L(x,t)=\sum _{n=-\infty }^{\infty }f(x-n)\,G(n,t)} where G ( n , t ) = 1 2 π t e − n 2 2 t {\displaystyle G(n,t)={\frac {1}{\sqrt {2\pi t}}}e^{-{\frac {n^{2}}{2t}}}} (with t = σ2) which in turn is truncated at the ends to give a filter with finite impulse response L ( x , t ) = ∑ n = − M M f ( x − n ) G ( n , t ) {\displaystyle L(x,t)=\sum _{n=-M}^{M}f(x-n)\,G(n,t)} for M chosen sufficiently large (see error function) such that 2 ∫ M ∞ G ( u , t ) d u = 2 ∫ M t ∞ G ( v , 1 ) d v < ε . {\displaystyle 2\int _{M}^{\infty }G(u,t)\,du=2\int _{\frac {M}{\sqrt {t}}}^{\infty }G(v,1)\,dv<\varepsilon .} A common choice is to set M to a constant C times the standard deviation of the Gaussian kernel M = C σ + 1 = C t + 1 {\displaystyle M=C\sigma +1=C{\sqrt {t}}+1} where C is often chosen somewhere between 3 and 6. Using the sampled Gaussian kernel can, however, lead to implementation problems, in particular when computing higher-order derivatives at finer scales by applying sampled derivatives of Gaussian kernels. When accuracy and robustness are primary design criteria, alternative implementation approaches should therefore be considered. For small values of ε (10−6 to 10−8) the errors introduced by truncating the Gaussian are usually negligible. For larger values of ε, however, there are many better alternatives to a rectangular window function. For example, for a given number of points, a Hamming window, Blackman window, or Kaiser window will do less damage to the spectral and other properties of the Gaussian than a simple truncation will. Notwithstanding this, since the Gaussian kernel decreases rapidly at the tails, the main recommendation is still to use a sufficiently small value of ε such that the truncation effects are no longer important. == The discrete Gaussian kernel == A more refined approach is to convolve the original signal with the discrete Gaussian kernel T(n, t) L ( x , t ) = ∑ n = − ∞ ∞ f ( x − n ) T ( n , t ) {\displaystyle L(x,t)=\sum _{n=-\infty }^{\infty }f(x-n)\,T(n,t)} where T ( n , t ) = e − t I n ( t ) {\displaystyle T(n,t)=e^{-t}I_{n}(t)} and I n ( t ) {\displaystyle I_{n}(t)} denotes the modified Bessel functions of integer order, n. This is the discrete counterpart of the continuous Gaussian in that it is the solution to the discrete diffusion equation (discrete space, continuous time), just as the continuous Gaussian is the solution to the continuous diffusion equation. This filter can be truncated in the spatial domain as for the sampled Gaussian L ( x , t ) = ∑ n = − M M f ( x − n ) T ( n , t ) {\displaystyle L(x,t)=\sum _{n=-M}^{M}f(x-n)\,T(n,t)} or can be implemented in the Fourier domain using a closed-form expression for its discrete-time Fourier transform: T ^ ( θ , t ) = ∑ n = − ∞ ∞ T ( n , t ) e − i θ n = e t ( cos ⁡ θ − 1 ) . {\displaystyle {\widehat {T}}(\theta ,t)=\sum _{n=-\infty }^{\infty }T(n,t)\,e^{-i\theta n}=e^{t(\cos \theta -1)}.} With this frequency-domain approach, the scale-space properties transfer exactly to the discrete domain, or with excellent approximation using periodic extension and a suitably long discrete Fourier transform to approximate the discrete-time Fourier transform of the signal being smoothed. Moreover, higher-order derivative approximations can be computed in a straightforward manner (and preserving scale-space properties) by applying small support central difference operators to the discrete scale space representation. As with the sampled Gaussian, a plain truncation of the infinite impulse response will in most cases be a sufficient approximation for small values of ε, while for larger values of ε it is better to use either a decomposition of the discrete Gaussian into a cascade of generalized binomial filters or alternatively to construct a finite approximate kernel by multiplying by a window function. If ε has been chosen too large such that effects of the truncation error begin to appear (for example as spurious extrema or spurious responses to higher-order derivative operators), then the options are to decrease the value of ε such that a larger finite kernel is used, with cutoff where the support is very small, or to use a tapered window. == Recursive filters == Since computational efficiency is often important, low-order recursive filters are often used for scale-space smoothing. For example, Young and van Vliet use a third-order recursive filter with one real pole and a pair of complex poles, applied forward and backward to make a sixth-order symmetric approximation to the Gaussian with low computational complexity for any smoothing scale. By relaxing a few of the axioms, Lindeberg concluded that good smoothing filters would be "normalized Pólya frequency sequences", a family of discrete kernels that includes all filters with real poles at 0 < Z < 1 and/or Z > 1, as well as with real zeros at Z < 0. For symmetry, which leads to approximate directional homogeneity, these filters must be further restricted to pairs of poles and zeros that lead to zero-phase filters. To match the transfer function curvature at zero frequency of the discrete Gaussian, which ensures an approximate semi-group property of additive t, two poles at Z = 1 + 2 t − ( 1 + 2 t ) 2 − 1 {\displaystyle

    Read more →
  • Niki.ai

    Niki.ai

    Niki was an artificial intelligence company headquartered in Bangalore, Karnataka. It was founded in May 2015 by IIT Kharagpur graduates Sachin Jaiswal, Keshav Prawasi, Shishir Modi, and Nitin Babel. The Niki android app was launched for a limited beta in June 2015, then released for public during YourStory's TechSparks 2015, and is a Tech30 company. The company raised an undisclosed amount in seed funding from Unilazer Ventures, a Mumbai-based VC firm founded by Ronnie Screwvala, in October 2015. This was followed by another seed funding round by Ratan Tata in May 2016. The company then raised US$2 million in Series A round of funding from SAP.iO, existing investors and some US and German-based investors, among others. Niki.ai shut down in October 2021 as per media reports. Website not working. == Product == The product is an artificial intelligence-powered chatbot which works as an intelligent personal assistant, named Niki. Leveraging natural language processing and machine learning, Niki presents a chat-based natural language user interface to the users where they can interact with Niki in their natural language. Niki understands how users chat in India, deciphers the words, in the context of product/services that they would like to purchase, and comes up with apt recommendations. Initially, it was only available on the Android platform as a mobile app. The company has expanded its operations to the Facebook Messenger and Apple iOS platforms. The company aims to soon be present on more messaging platforms like Slack and WhatsApp. The company currently provides 20+ services to over 2 million consumers, covering a wide spectrum ranging from utility services like mobile recharge, bill payments, travel services like cabs, buses, hotels and entertainment services like movies and events. Services such as flights and healthcare are also planned. == Partnerships == In September 2017, Infosys Finacle joined with Niki.ai to provide chat-based service to banking customers. In August 2017, Niki partnered with LazyPay to enable a 'buy now, pay later' feature for its users.

    Read more →