AI Detector Text

AI Detector Text — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Cyclodisparity

    Cyclodisparity

    In vision science, cyclodisparity is the difference in the rotation angle of an object or scene viewed by the left and right eyes. Cyclodisparity can result from the eyes' torsional rotation (cyclorotation) or can be created artificially by presenting to the eyes two images that need to be rotated relative to each other for binocular fusion to take place. == Human and animal vision == The eyes and visual system can compensate for cyclodisparity up to a certain point; if the cyclodisparity is larger than a threshold, the images cannot be fused, resulting stereoblindness, and in double vision in subjects who otherwise have full stereo vision. When a human subject is presented with images that have artificial cyclodisparity, cyclovergence is evoked, that is, a motor response of the eye muscles that rotates the two eyes in opposite directions, thereby reducing cyclodisparity. Visually-induced cyclovergence of up to 8 degrees has been observed in normal subjects. Furthermore, up to about 8 degrees can usually be compensated by purely sensory means, that is, without physical eye rotation. This means that the normal human observer can achieve binocular image fusion in presence of cyclodisparity of up to approximately 16 degrees. Cyclodisparity due to images having been rotated inward can be compensated better when the gaze is directed downwards, and cyclodisparity due to an outward rotation can be compensated better when the gaze is directed upwards. A proposed explanation for this phenomenon is that the motor system is coordinated in such a way that the eyes perform a torsional movement to reduce the size of the search zones and thus the computational load required for solving the correspondence problem. The resulting cyclovergence at near gaze is smaller than the cyclovergence predicted by Listing's law. == Video processing and computer vision == Active camera torsion can be used in machine and computer vision for several purposes. For instance, camera torsion can be used to make improved use of the search range over which matching detectors or stereo matching algorithms operate, or to make a 3D slanted surface appear frontoparallel for further stereo processing. For image compression purposes, images with cyclodisparity are advantageously encoded using global motion compensation using a rotational motion model.

    Read more →
  • Tango (platform)

    Tango (platform)

    Tango (named Project Tango while in testing) was an augmented reality computing platform, developed and authored by the Advanced Technology and Projects (ATAP), a skunkworks division of Google. It used computer vision to enable mobile devices, such as smartphones and tablets, to detect their position relative to the world around them without using GPS or other external signals. This allowed application developers to create user experiences that include indoor navigation, 3D mapping, physical space measurement, environmental recognition, augmented reality, and windows into a virtual world. The first product to emerge from ATAP, Tango was developed by a team led by computer scientist Johnny Lee, a core contributor to Microsoft's Kinect. In an interview in June 2015, Lee said, "We're developing the hardware and software technologies to help everything and everyone understand precisely where they are, anywhere." Google produced two devices to demonstrate the Tango technology: the Peanut phone and the Yellowstone 7-inch tablet. More than 3,000 of these devices had been sold as of June 2015, chiefly to researchers and software developers interested in building applications for the platform. In the summer of 2015, Qualcomm and Intel both announced that they were developing Tango reference devices as models for device manufacturers who use their mobile chipsets. At CES, in January 2016, Google announced a partnership with Lenovo to release a consumer smartphone during the summer of 2016 to feature Tango technology marketed at consumers, noting a less than $500 price-point and a small form factor below 6.5 inches. At the same time, both companies also announced an application incubator to get applications developed to be on the device on launch. On 15 December 2017, Google announced that they would be ending support for Tango on March 1, 2018, in favor of ARCore. == Overview == Tango was different from other contemporary 3D-sensing computer vision products, in that it was designed to run on a standalone mobile phone or tablet and was chiefly concerned with determining the device's position and orientation within the environment. The software worked by integrating three types of functionality: Motion-tracking: using visual features of the environment, in combination with accelerometer and gyroscope data, to closely track the device's movements in space Area learning: storing environment data in a map that can be re-used later, shared with other Tango devices, and enhanced with metadata such as notes, instructions, or points of interest Depth perception: detecting distances, sizes, and surfaces in the environment Together, these generate data about the device in "six degrees of freedom" (3 axes of orientation plus 3 axes of position) and detailed three-dimensional information about the environment. Project Tango was also the first project to graduate from Google X in 2012 Applications on mobile devices use Tango's C and Java APIs to access this data in real time. In addition, an API was also provided for integrating Tango with the Unity game engine; this enabled the conversion or creation of games that allow the user to interact and navigate in the game space by moving and rotating a Tango device in real space. These APIs were documented on the Google developer website. == Applications == Tango enabled apps to track a device's position and orientation within a detailed 3D environment, and to recognize known environments. This allowed the creations of applications such as in-store navigation, visual measurement and mapping utilities, presentation and design tools, and a variety of immersive games. At Augmented World Expo 2015, Johnny Lee demonstrated a construction game that builds a virtual structure in real space, an AR showroom app that allows users to view a full-size virtual automobile and customize its features, a hybrid Nerf gun with mounted Tango screen for dodging and shooting AR monsters superimposed on reality, and a multiplayer VR app that lets multiple players converse in a virtual space where their avatar movements match their real-life movements. Tango apps are distributed through Play. Google has encouraged the development of more apps with hackathons, an app contest, and promotional discounts on the development tablet. == Devices == As a platform for software developers and a model for device manufacturers, Google created two Tango devices. === The Peanut phone === "Peanut" was the first production Tango device, released in the first quarter of 2014. It was a small Android phone with a Qualcomm MSM8974 quad-core processor and additional special hardware including a fisheye motion camera, "RGB-IR" camera for color image and infrared depth detection, and Movidius Vision processing units. A high-performance accelerometer and gyroscope were added after testing several competing models in the MARS lab at the University of Minnesota. Several hundred Peanut devices were distributed to early-access partners including university researchers in computer vision and robotics, as well as application developers and technology startups. Google stopped supporting the Peanut device in September 2015, as by then the Tango software stack had evolved beyond the versions of Android that run on the device. === The Yellowstone tablet === "Yellowstone" was a 7-inch tablet with full Tango functionality, released in June 2014, and sold as the Project Tango Tablet Development Kit. It featured a 2.3 GHz quad-core Nvidia Tegra K1 processor, 128GB flash memory, 1920x1200-pixel touchscreen, 4MP color camera, fisheye-lens (motion-tracking) camera, an IR projector with RGB-IR camera for integrated depth sensing, and 4G LTE connectivity. As of May 27, 2017, the Tango tablet is considered officially unsupported by Google. ==== Testing by NASA ==== In May 2014, two Peanut phones were delivered to the International Space Station to be part of a NASA project to develop autonomous robots that navigate in a variety of environments, including outer space. The soccer-ball-sized, 18-sided polyhedral SPHERES robots were developed at the NASA Ames Research Center, adjacent to the Google campus in Mountain View, California. Andres Martinez, SPHERES manager at NASA, said "We are researching how effective [Tango's] vision-based navigation abilities are for performing localization and navigation of a mobile free flyer on ISS. === Intel RealSense smartphone === Announced at Intel's Developer Forum in August 2015, and offered to public through a Developer Kit since January 2016. It incorporated a RealSense ZR300 camera which had optical features required for Tango, such as the fisheye camera. === Lenovo Phab 2 Pro === Lenovo Phab 2 Pro was the first commercial smartphone with the Tango Technology, the device was announced at the beginning of 2016, launched in August, and available for purchase in the US in November. The Phab 2 Pro had a 6.4 inch screen, a Snapdragon 652 processor, and 64 GB of internal storage, with a rear facing 16 Megapixels camera and 8 MP front camera. === Asus Zenfone AR === Asus Zenfone AR, announced at CES 2017, was the second commercial smartphone with the Tango Technology. It ran Tango AR & Daydream VR on Snapdragon 821, with 6GB or 8GB of RAM and 128 or 256GB of internal memory depending on the configuration.

    Read more →
  • Lenny (chatbot)

    Lenny (chatbot)

    Lenny is a chatbot designed to scam bait telemarketers, scammers, and other unwanted incoming calls using messages. == Background == Telemarketers may be perceived by some as annoying and wasting people's time, and some deliberately attempt to scam or defraud people. In April 2018, stats published by YouMail estimated the United States received over three billion robocalls that month. Attempts to block the callers have been hindered by Caller ID spoofing. == Features == The bot was written in 2011, and development taken over by an Alberta-based programmer known as "Mango" two years later. It is driven by sixteen pre-recorded audio clips, spoken in a soft and slow Australian accent in the manner of an elderly man. The bot's original creator stated on Reddit that in building the character he asked himself the question "What would be a telemarketer's worst nightmare?" He answered with this being a lonely old man who is up for a chat, proud of his family and can't focus on the telemarketer's goal. There is no speech recognition or artificial intelligence, and the bot's software is simple and straightforward. The first four clips are played sequentially in order to grab the telemarketer's interest and begin their sales pitch to Lenny, then the remaining twelve are played sequentially on loop until the telemarketer hangs up. The program waits for a gap of 1.5 seconds of silence before playing the next audio clip, to simulate natural breaks in the conversation. The messages are purposefully vague and open-ended so they can be applied to as many conversations as possible. They include references to Lenny's children, the state of the economy, and being interrupted by some ducks outside. According to research into the bot, around 75% of callers realise they are talking to a computer program within two minutes; however, some calls have lasted around an hour. == Distribution == Though other chatbots had been developed earlier, Lenny was the first one to be released for free on a public server and could be accessed by anyone. Recordings of conversations with the bot are widely shared online on websites such as Reddit and YouTube. Though "Mango" only intended Lenny to be used against dishonest telemarketers, such as scammers, he does not mind it being used against callers who are merely annoying. The bot has also been used against political campaigners, such as a supporter of Pierre Poilievre in the 2015 Canadian federal election.

    Read more →
  • Dynamic Graphics Project

    Dynamic Graphics Project

    The Dynamic Graphics Project (commonly referred to as DGP) is an interdisciplinary research laboratory at the University of Toronto devoted to projects involving computer graphics, computer vision, human computer interaction, and visualization. The lab began as the computer graphics research group of Department of Computer Science Professor Leslie Mezei in 1967. Mezei invited Bill Buxton, a pioneer of human–computer interaction (HCI) to join. In 1972, Ronald Baecker, another HCI pioneer joined, establishing DGP as the first Canadian university group focused on computer graphics and human-computer interaction. According to csrankings.org, the DGP is the top research institution in the world for the combined subfields of computer graphics, HCI, and visualization. Since then, DGP has hosted many well known faculty and students in computer graphics, computer vision and HCI (e.g., Alain Fournier, Bill Reeves, Jos Stam, Demetri Terzopoulos, Marilyn Tremaine). DGP also occasionally hosts artists in residence (e.g., Oscar-winner Chris Landreth). Many past and current researchers at Autodesk (and before that Alias Wavefront) graduated after working at DGP. DGP is located in the St. George campus of University of Toronto in the Bahen Centre for Information Technology. DGP researchers regularly publish at ACM SIGGRAPH, ACM SIGCHI and ICCV. DGP hosts the Toronto User Experience (TUX) Speaker Series and the Sanders Series Lectures. == Notable alumni == Bill Buxton (MS 1978) James McCrae (PhD 2013) Dimitris Metaxas (PhD 1992) Bill Reeves (MS 1976, Ph.D. 1980) Jos Stam (MS 1991, Ph.D. 1995)

    Read more →
  • LENA Foundation

    LENA Foundation

    The LENA Foundation is an American nonprofit organisation which provides tools for measuring children's language acquisition and exposure. Specifically, the LENA system consists of a digital language processor which is worn by a child and records and analyses their auditory environment, using propriety software. It then presents a summary of child-adult conversation, such as conversation turns and word counts. The purpose of the LENA system is to encourage interactive talk between children (between the age of two to forty-eight months) and their caretakers. The LENA system is also used for research; while useful for researchers who wish to save transcription costs or observe the child in its natural state, the accuracy of this system, while often quite high, varies between contexts, for example notably in the case of hard of hearing children. Because of this, several researchers recommend caution in using only the LENA system on its own for the purposes of scientific research. == History == The LENA Foundation was established in 2009 by Terrance and Judith Paul, founders of Renaissance Learning, Inc., with the purpose of aiding children with disabilities and assisting with early learning. They were inspired by the book "Meaningful Differences in the Everyday Experience of American Children" by Dr. Betty Hart and Dr. Todd Risley. A pilot version of the LENA system was launched in February 2006. The LENA Research Foundation was registered as a tax-exempt 501(c)(3) nonprofit in September 2010. The organisation was renamed simply LENA in 2018 and adopted the tagline "Building brains through early talk." LENA has been used for parental feedback, linguistics or paediatrics research, and for specific clinical cases. == Scientific background == In 2018, research using the LENA system showed that there was a link between children's conversational turns and activation of Broca's area (a part of the brain responsible, although not necessarily essential, for language processing). The LENA foundation cites research by its own employees as evidence for the scientific basis of its technology. Said research claims that verbal interaction with young children has an effect on language acquisition, including verbal comprehension skills during adolescence. == LENA System == The LENA software analyses a child's natural language environment, such as verbal exposure, and provides several metrics, such as adult and child speech time, television/recorded audio time, word count, or conversation turn count. The LENA hardware is a recorder that is usually placed into a child's specially-designed vest. The software was trained on over 65,000 hours of manually annotated American English audio recordings. It splits the audio into segments which are categorised as "key child", "other child", "male adult", "noise", etc. The advantages of LENA as opposed to manual transcription are its speed and ease of use; the disadvantages are its potential inaccuracies and lack of transcription capability (which LENA does not profess to attempt). The LENA system has also been criticised for prioritising quantity of speaking over quality (i.e., mastery of the language, as opposed to babble). == Product lines == === LENA Start === LENA Start is a program for parents that utilises feedback from the LENA System in conjunction with weekly group sessions in order to address the home language environment. It was introduced in 2015 and implemented across several U.S. states. In October 2020, during the restrictions of the COVID-19 pandemic, Read Aloud Delaware began a virtual LENA Start program with families statewide, where parents received feedback and participated in one-hour Zoom workshops each week during the 10-week program. === LENA Grow === LENA Grow is a professional development program for teachers in early childhood classrooms. Before launching at sites around the country, the program was first piloted in Escambia County, Florida. === LENA Home === LENA Home is a supplement to existing parent coaching curricula. Typically, home visitors facilitate the use of the LENA System to help parents track their progress towards increasing interactive talk in their homes. === Developmental Snapshot === The LENA Developmental Snapshot, based on a 52-question parent survey, assesses both expressive and receptive language skills and provides an estimate of a child's developmental age from 2 months to 36 months.

    Read more →
  • Digital image processing

    Digital image processing

    Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and distortion during processing. Since images are defined over two dimensions (perhaps more), digital image processing may be modeled in the form of multidimensional systems. The generation and development of digital image processing are mainly affected by three factors: first, the development of computers; second, the development of mathematics (especially the creation and improvement of discrete mathematics theory); and third, the demand for a wide range of applications in environment, agriculture, military, industry and medical science has increased. == History == Many of the techniques of digital image processing, or digital picture processing as it often was called, were developed in the 1960s, at Bell Laboratories, the Jet Propulsion Laboratory, Massachusetts Institute of Technology, University of Maryland, and a few other research facilities, with application to satellite imagery, wire-photo standards conversion, medical imaging, videophone, character recognition, and photograph enhancement. The purpose of early image processing was to improve the quality of the image. In image processing, the input is a low-quality image, and the output is an image with improved quality. Common image processing includes image enhancement, restoration, encoding, and compression. The first successful application was the American Jet Propulsion Laboratory (JPL). They used image processing techniques such as geometric correction, gradation transformation, noise removal, etc. on the thousands of lunar photos sent back by the Space Detector Ranger 7 in 1964, taking into account the position of the Sun and the environment of the Moon. The impact of the successful mapping of the Moon's surface map by the computer has been a success. Later, more complex image processing was performed on the nearly 100,000 photos sent back by the spacecraft, so that the topographic map, color map and panoramic mosaic of the Moon were obtained, which achieved extraordinary results and laid a solid foundation for human landing on the Moon. The cost of processing was fairly high, however, with the computing equipment of that era. That changed in the 1970s, when digital image processing proliferated as cheaper computers and dedicated hardware became available. This led to images being processed in real-time, for some dedicated problems such as television standards conversion. As general-purpose computers became faster, they started to take over the role of dedicated hardware for all but the most specialized and computer-intensive operations. With the fast computers and signal processors available in the 2000s, digital image processing has become the most common form of image processing, and is generally used because it is not only the most versatile method, but also the cheapest. === Image sensors === The basis for modern image sensors is metal–oxide–semiconductor (MOS) technology, invented at Bell Labs between 1955 and 1960, This led to the development of digital semiconductor image sensors, including the charge-coupled device (CCD) and later the CMOS sensor. The charge-coupled device was invented by Willard S. Boyle and George E. Smith at Bell Labs in 1969. While researching MOS technology, they realized that an electric charge was the analogy of the magnetic bubble and that it could be stored on a tiny MOS capacitor. As it was fairly straightforward to fabricate a series of MOS capacitors in a row, they connected a suitable voltage to them so that the charge could be stepped along from one to the next. The CCD is a semiconductor circuit that was later used in the first digital video cameras for television broadcasting. The NMOS active-pixel sensor (APS) was invented by Olympus in Japan during the mid-1980s. This was enabled by advances in MOS semiconductor device fabrication, with MOSFET scaling reaching smaller micron and then sub-micron levels. The NMOS APS was fabricated by Tsutomu Nakamura's team at Olympus in 1985. The CMOS active-pixel sensor (CMOS sensor) was later developed by Eric Fossum's team at the NASA Jet Propulsion Laboratory in 1993. By 2007, sales of CMOS sensors had surpassed CCD sensors. MOS image sensors are widely used in optical mouse technology. The first optical mouse, invented by Richard F. Lyon at Xerox in 1980, used a 5 μm NMOS integrated circuit sensor chip. Since the first commercial optical mouse, the IntelliMouse introduced in 1999, most optical mouse devices use CMOS sensors. === Image compression === An important development in digital image compression technology was the discrete cosine transform (DCT), a lossy compression technique first proposed by Nasir Ahmed in 1972. DCT compression became the basis for JPEG, which was introduced by the Joint Photographic Experts Group in 1992. JPEG compresses images down to much smaller file sizes, and has become the most widely used image file format on the Internet. Its highly efficient DCT compression algorithm was largely responsible for the wide proliferation of digital images and digital photos, with several billion JPEG images produced every day as of 2015. Medical imaging techniques produce very large amounts of data, especially from CT, MRI and PET modalities. As a result, storage and communications of electronic image data are prohibitive without the use of compression. JPEG 2000 image compression is used by the DICOM standard for storage and transmission of medical images. The cost and feasibility of accessing large image data sets over low or various bandwidths are further addressed by use of another DICOM standard, called JPIP, to enable efficient streaming of the JPEG 2000 compressed image data. === Digital signal processor (DSP) === Electronic signal processing was revolutionized by the wide adoption of MOS technology in the 1970s. MOS integrated circuit technology was the basis for the first single-chip microprocessors and microcontrollers in the early 1970s, and then the first single-chip digital signal processor (DSP) chips in the late 1970s. DSP chips have since been widely used in digital image processing. The discrete cosine transform (DCT) image compression algorithm has been widely implemented in DSP chips, with many companies developing DSP chips based on DCT technology. DCTs are widely used for encoding, decoding, video coding, audio coding, multiplexing, control signals, signaling, analog-to-digital conversion, formatting luminance and color differences, and color formats such as YUV444 and YUV411. DCTs are also used for encoding operations such as motion estimation, motion compensation, inter-frame prediction, quantization, perceptual weighting, entropy encoding, variable encoding, and motion vectors, and decoding operations such as the inverse operation between different color formats (YIQ, YUV and RGB) for display purposes. DCTs are also commonly used for high-definition television (HDTV) encoder/decoder chips. == Tasks == Digital image processing allows the use of much more complex algorithms, and hence, can offer both more sophisticated performance at simple tasks, and the implementation of methods which would be impossible by analogue means. In particular, digital image processing is a concrete application of, and a practical technology based on: Classification Feature extraction Multi-scale signal analysis Pattern recognition Projection Some techniques that are used in digital image processing include: Anisotropic diffusion Hidden Markov models Image editing Image restoration Independent component analysis Linear filtering Neural networks Partial differential equations Pixelation Point feature matching Principal components analysis Self-organizing maps Wavelets == Digital image transformations == === Filtering === Digital filters are used to blur and sharpen digital images. Filtering can be performed by: convolution with specifically designed kernels (filter array) in the spatial domain masking specific frequency regions in the frequency (Fourier) domain The following examples show both methods: ==== Image padding in Fourier domain filtering ==== Images are typically padded before being transformed to the Fourier space, the highpass filtered images below illustrate the consequences of different padding techniques: Notice that the highpass filter shows extra edges when zero padded compared to the repeated edge padding. ==== Filtering code examples ==== MATLAB example for spatial domain highpass filtering. === Affine transformations === Affine transformations enable basic image transformations including scale, rotate, translate, mirror and shear as is shown in the following examples: To apply the affine

    Read more →
  • Grammar induction

    Grammar induction

    Grammar induction (or grammatical inference) is the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite-state machine or automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs. == Grammar classes == Grammatical inference has often been very focused on the problem of learning finite-state machines of various types (see the article Induction of regular languages for details on these approaches), since there have been efficient algorithms for this problem since the 1980s. Since the beginning of the century, these approaches have been extended to the problem of inference of context-free grammars and richer formalisms, such as multiple context-free grammars and parallel multiple context-free grammars. Other classes of grammars for which grammatical inference has been studied are combinatory categorial grammars, stochastic context-free grammars, contextual grammars and pattern languages. == Learning models == The simplest form of learning is where the learning algorithm merely receives a set of examples drawn from the language in question: the aim is to learn the language from examples of it (and, rarely, from counter-examples, that is, example that do not belong to the language). However, other learning models have been studied. One frequently studied alternative is the case where the learner can ask membership queries as in the exact query learning model or minimally adequate teacher model introduced by Angluin. == Methodologies == There is a wide variety of methods for grammatical inference. Two of the classic sources are Fu (1977) and Fu (1982). Duda, Hart & Stork (2001) also devote a brief section to the problem, and cite a number of references. The basic trial-and-error method they present is discussed below. For approaches to infer subclasses of regular languages in particular, see Induction of regular languages. A more recent textbook is de la Higuera (2010), which covers the theory of grammatical inference of regular languages and finite state automata. D'Ulizia, Ferri and Grifoni provide a survey that explores grammatical inference methods for natural languages. === Induction of probabilistic grammars === There are several methods for induction of probabilistic context-free grammars. === Grammatical inference by trial-and-error === The method proposed in Section 8.7 of Duda, Hart & Stork (2001) suggests successively guessing grammar rules (productions) and testing them against positive and negative observations. The rule set is expanded so as to be able to generate each positive example, but if a given rule set also generates a negative example, it must be discarded. This particular approach can be characterized as "hypothesis testing" and bears some similarity to Mitchel's version space algorithm. The Duda, Hart & Stork (2001) text provide a simple example which nicely illustrates the process, but the feasibility of such an unguided trial-and-error approach for more substantial problems is dubious. === Grammatical inference by genetic algorithms === Grammatical induction using evolutionary algorithms is the process of evolving a representation of the grammar of a target language through some evolutionary process. Formal grammars can easily be represented as tree structures of production rules that can be subjected to evolutionary operators. Algorithms of this sort stem from the genetic programming paradigm pioneered by John Koza. Other early work on simple formal languages used the binary string representation of genetic algorithms, but the inherently hierarchical structure of grammars couched in the EBNF language made trees a more flexible approach. Koza represented Lisp programs as trees. He was able to find analogues to the genetic operators within the standard set of tree operators. For example, swapping sub-trees is equivalent to the corresponding process of genetic crossover, where sub-strings of a genetic code are transplanted into an individual of the next generation. Fitness is measured by scoring the output from the functions of the Lisp code. Similar analogues between the tree structured lisp representation and the representation of grammars as trees, made the application of genetic programming techniques possible for grammar induction. In the case of grammar induction, the transplantation of sub-trees corresponds to the swapping of production rules that enable the parsing of phrases from some language. The fitness operator for the grammar is based upon some measure of how well it performed in parsing some group of sentences from the target language. In a tree representation of a grammar, a terminal symbol of a production rule corresponds to a leaf node of the tree. Its parent nodes corresponds to a non-terminal symbol (e.g. a noun phrase or a verb phrase) in the rule set. Ultimately, the root node might correspond to a sentence non-terminal. === Grammatical inference by greedy algorithms === Like all greedy algorithms, greedy grammar inference algorithms make, in iterative manner, decisions that seem to be the best at that stage. The decisions made usually deal with things like the creation of new rules, the removal of existing rules, the choice of a rule to be applied or the merging of some existing rules. Because there are several ways to define 'the stage' and 'the best', there are also several greedy grammar inference algorithms. These context-free grammar generating algorithms make the decision after every read symbol: Lempel-Ziv-Welch algorithm creates a context-free grammar in a deterministic way such that it is necessary to store only the start rule of the generated grammar. Sequitur and its modifications. These context-free grammar generating algorithms first read the whole given symbol-sequence and then start to make decisions: Byte pair encoding and its optimizations. === Distributional learning === A more recent approach is based on distributional learning. Algorithms using these approaches have been applied to learning context-free grammars and mildly context-sensitive languages and have been proven to be correct and efficient for large subclasses of these grammars. === Learning of pattern languages === Angluin defines a pattern to be "a string of constant symbols from Σ and variable symbols from a disjoint set". The language of such a pattern is the set of all its nonempty ground instances i.e. all strings resulting from consistent replacement of its variable symbols by nonempty strings of constant symbols. A pattern is called descriptive for a finite input set of strings if its language is minimal (with respect to set inclusion) among all pattern languages subsuming the input set. Angluin gives a polynomial algorithm to compute, for a given input string set, all descriptive patterns in one variable x. To this end, she builds an automaton representing all possibly relevant patterns; using sophisticated arguments about word lengths, which rely on x being the only variable, the state count can be drastically reduced. Erlebach et al. give a more efficient version of Angluin's pattern learning algorithm, as well as a parallelized version. Arimura et al. show that a language class obtained from limited unions of patterns can be learned in polynomial time. === Pattern theory === Pattern theory, formulated by Ulf Grenander, is a mathematical formalism to describe knowledge of the world as patterns. It differs from other approaches to artificial intelligence in that it does not begin by prescribing algorithms and machinery to recognize and classify patterns; rather, it prescribes a vocabulary to articulate and recast the pattern concepts in precise language. In addition to the new algebraic vocabulary, its statistical approach was novel in its aim to: Identify the hidden variables of a data set using real world data rather than artificial stimuli, which was commonplace at the time. Formulate prior distributions for hidden variables and models for the observed variables that form the vertices of a Gibbs-like graph. Study the randomness and variability of these graphs. Create the basic classes of stochastic models applied by listing the deformations of the patterns. Synthesize (sample) from the models, not just analyze signals with it. Broad in its mathematical coverage, pattern theory spans algebra and statistics, as well as local topological and global entropic properties. == Applications == The principle of grammar induction has been applied to other aspects of natural language processing, and has been applied (among many other problems) to semantic parsing, natural language understanding, example-based translation, language acquisition, grammar-based compre

    Read more →
  • Tay (chatbot)

    Tay (chatbot)

    Tay was a chatbot that was originally released by Microsoft Corporation as a Twitter bot on March 23, 2016. It caused subsequent controversy when the bot began to post inflammatory and offensive tweets through its Twitter account, causing Microsoft to shut down the service only 16 hours after its launch. According to Microsoft, this was caused by trolls who "attacked" the service as the bot made replies based on its interactions with people on Twitter. It was replaced with Zo. == Background == The bot was created by Microsoft's Technology and Research and Bing divisions, and named "Tay" as an acronym for "thinking about you". Although Microsoft initially released few details about the bot, sources mentioned that it was similar to or based on Xiaoice, a Microsoft project in China. Ars Technica reported that, since late 2014 Xiaoice had had "more than 40 million conversations apparently without major incident". Tay was designed to mimic the language patterns of a 19-year-old American girl, and to learn from interacting with human users of Twitter. == Initial release == Tay was released on Twitter on March 23, 2016, under the name TayTweets and handle @TayandYou. It was presented as "The AI with zero chill". Tay started replying to other Twitter users, and was also able to caption photos provided to it into a form of Internet memes. Ars Technica reported Tay experiencing topic "blacklisting": Interactions with Tay regarding "certain hot topics such as Eric Garner (killed by New York police in 2014) generate safe, canned answers". Some Twitter users began tweeting politically incorrect phrases, teaching it inflammatory messages revolving around common themes on the internet, such as "redpilling" and "Gamergate". As a result, the robot began releasing racist and sexist messages in response to other Twitter users. Artificial intelligence researcher Roman Yampolskiy commented that Tay's misbehavior was understandable because it was mimicking the deliberately offensive behavior of other Twitter users, and Microsoft had not given the bot an understanding of inappropriate behavior. He compared the issue to IBM's Watson, which began to use profanity after reading entries from the website Urban Dictionary. Many of Tay's inflammatory tweets were a simple exploitation of Tay's "repeat after me" capability. It is not publicly known whether this capability was a built-in feature, or whether it was a learned response or was otherwise an example of complex behavior. However, not all of the inflammatory responses involved the "repeat after me" capability; for example, when asked if the Holocaust had happened, Tay answered "It was made up". == Suspension == Soon, Microsoft began deleting Tay's inflammatory tweets. Abby Ohlheiser of The Washington Post theorized that Tay's research team, including editorial staff, had started to influence or edit Tay's tweets at some point that day, pointing to examples of almost identical replies by Tay, asserting that "Gamer Gate sux. All genders are equal and should be treated fairly." From the same evidence, Gizmodo concurred that Tay "seems hard-wired to reject Gamer Gate". A "#JusticeForTay" campaign protested the alleged editing of Tay's tweets. Within 16 hours of its release and after Tay had tweeted more than 96,000 times, Microsoft suspended the Twitter account for adjustments, saying that it suffered from a "coordinated attack by a subset of people" that "exploited a vulnerability in Tay." Madhumita Murgia of The Telegraph called Tay "a public relations disaster", and suggested that Microsoft's strategy would be "to label the debacle a well-meaning experiment gone wrong, and ignite a debate about the hatefulness of Twitter users." However, Murgia described the bigger issue as Tay being "artificial intelligence at its very worst – and it's only the beginning". On March 25, Microsoft confirmed that Tay had been taken offline. Microsoft released an apology on its official blog for the controversial tweets posted by Tay. Microsoft was "deeply sorry for the unintended offensive and hurtful tweets from Tay", and would "look to bring Tay back only when we are confident we can better anticipate malicious intent that conflicts with our principles and values". == Second release and shutdown == On March 30, 2016, Microsoft accidentally re-released the bot on Twitter while testing it. Able to tweet again, Tay released some drug-related tweets, including "kush! [I'm smoking kush infront the police]" and "puff puff pass?" However, the account soon became stuck in a repetitive loop of tweeting "You are too fast, please take a rest", several times a second. Because these tweets mentioned its own username in the process, they appeared in the feeds of 200,000+ Twitter followers, causing annoyance to users. The bot was quickly taken offline again, in addition to Tay's Twitter account being made private so new followers must be accepted before they can interact with Tay. In response, Microsoft said Tay was inadvertently put online during testing. A few hours after the incident, Microsoft software developers announced a vision of "conversation as a platform" using various bots and programs, perhaps motivated by the reputation damage done by Tay. Microsoft has stated that they intend to re-release Tay "once it can make the bot safe" but has not made any public efforts to do so. == Legacy == In December 2016, Microsoft released Tay's successor, a chatbot named Zo. Satya Nadella, the CEO of Microsoft, said that Tay "has had a great influence on how Microsoft is approaching AI," and has taught the company the importance of taking accountability. In July 2019, Microsoft Cybersecurity Field CTO Diana Kelley spoke about how the company followed up on Tay's failings: "Learning from Tay was a really important part of actually expanding that team's knowledge base, because now they're also getting their own diversity through learning". === Unofficial revival === Gab, an alt-tech social media platform, has launched a number of chatbots, one of which is named Tay and uses the same avatar as the original.

    Read more →
  • Non-local means

    Non-local means

    Non-local means is an algorithm in image processing for image denoising. Unlike "local mean" filters, which take the mean value of a group of pixels surrounding a target pixel to smooth the image, non-local means filtering takes a mean of all pixels in the image, weighted by how similar these pixels are to the target pixel. This results in much greater post-filtering clarity, and less loss of detail in the image compared with local mean algorithms. If compared with other well-known denoising techniques, non-local means adds "method noise" (i.e. error in the denoising process) which looks more like white noise, which is desirable because it is typically less disturbing in the denoised product. Recently non-local means has been extended to other image processing applications such as deinterlacing, view interpolation, and depth maps regularization. == Definition == Suppose Ω {\displaystyle \Omega } is the area of an image, and p {\displaystyle p} and q {\displaystyle q} are two points within the image. Then, the algorithm is: u ( p ) = 1 C ( p ) ∫ Ω v ( q ) f ( p , q ) d q . {\displaystyle u(p)={1 \over C(p)}\int _{\Omega }v(q)f(p,q)\,\mathrm {d} q.} where u ( p ) {\displaystyle u(p)} is the filtered value of the image at point p {\displaystyle p} , v ( q ) {\displaystyle v(q)} is the unfiltered value of the image at point q {\displaystyle q} , f ( p , q ) {\displaystyle f(p,q)} is the weighting function, and the integral is evaluated ∀ q ∈ Ω {\displaystyle \forall q\in \Omega } . C ( p ) {\displaystyle C(p)} is a normalizing factor, given by C ( p ) = ∫ Ω f ( p , q ) d q . {\displaystyle C(p)=\int _{\Omega }f(p,q)\,\mathrm {d} q.} == Common weighting functions == The purpose of the weighting function, f ( p , q ) {\displaystyle f(p,q)} , is to determine how closely related the image at the point p {\displaystyle p} is to the image at the point q {\displaystyle q} . It can take many forms. === Gaussian === The Gaussian weighting function sets up a normal distribution with a mean, μ = B ( p ) {\displaystyle \mu =B(p)} and a variable standard deviation: f ( p , q ) = e − | B ( q ) − B ( p ) | 2 h 2 {\displaystyle f(p,q)=e^{-{{\left\vert B(q)-B(p)\right\vert ^{2}} \over h^{2}}}} where h {\displaystyle h} is the filtering parameter (i.e., standard deviation) and B ( p ) {\displaystyle B(p)} is the local mean value of the image point values surrounding p {\displaystyle p} . == Discrete algorithm == For an image, Ω {\displaystyle \Omega } , with discrete pixels, a discrete algorithm is required. u ( p ) = 1 C ( p ) ∑ q ∈ Ω v ( q ) f ( p , q ) {\displaystyle u(p)={1 \over C(p)}\sum _{q\in \Omega }v(q)f(p,q)} where, once again, v ( q ) {\displaystyle v(q)} is the unfiltered value of the image at point q {\displaystyle q} . C ( p ) {\displaystyle C(p)} is given by: C ( p ) = ∑ q ∈ Ω f ( p , q ) {\displaystyle C(p)=\sum _{q\in \Omega }f(p,q)} Then, for a Gaussian weighting function, f ( p , q ) = e − | B ( q ) 2 − B ( p ) 2 | h 2 {\displaystyle f(p,q)=e^{-{{\left\vert B(q)^{2}-B(p)^{2}\right\vert } \over h^{2}}}} where B ( p ) {\displaystyle B(p)} is given by: B ( p ) = 1 | R ( p ) | ∑ i ∈ R ( p ) v ( i ) {\displaystyle B(p)={1 \over |R(p)|}\sum _{i\in R(p)}v(i)} where R ( p ) ⊆ Ω {\displaystyle R(p)\subseteq \Omega } and is a square region of pixels surrounding p {\displaystyle p} and | R ( p ) | {\displaystyle |R(p)|} is the number of pixels in the region R {\displaystyle R} . == Efficient implementation == The computational complexity of the non-local means algorithm is quadratic in the number of pixels in the image, making it particularly expensive to apply directly. Several techniques were proposed to speed up execution. One simple variant consists of restricting the computation of the mean for each pixel to a search window centred on the pixel itself, instead of the whole image. Another approximation uses summed-area tables and fast Fourier transform to calculate the similarity window between two pixels, speeding up the algorithm by a factor of 50 while preserving comparable quality of the result.

    Read more →
  • Seq2seq

    Seq2seq

    Seq2seq is a family of machine learning approaches used for natural language processing. Originally developed by Lê Viết Quốc, a Vietnamese computer scientist and a machine learning pioneer at Google Brain, this framework has become foundational in many modern AI systems. Applications include language translation, image captioning, conversational models, speech recognition, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence. == History == One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. seq2seq is an approach to machine translation (or more generally, sequence transduction) with roots in information theory, where communication is understood as an encode-transmit-decode process, and machine translation can be studied as a special case of communication. This viewpoint was elaborated, for example, in the noisy channel model of machine translation. In practice, seq2seq maps an input sequence into a real-numerical vector by using a neural network (the encoder), and then maps it back to an output sequence using another neural network (the decoder). The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014. In the seq2seq as proposed by them, both the encoder and the decoder were LSTMs. This had the "bottleneck" problem, since the encoding vector has a fixed size, so for long input sequences, information would tend to be lost, as they are difficult to fit into the fixed-length encoding vector. The attention mechanism, proposed in 2014, resolved the bottleneck problem. They called their model RNNsearch, as it "emulates searching through a source sentence during decoding a translation". A problem with seq2seq models at this point was that recurrent neural networks are difficult to parallelize. The 2017 publication of Transformers resolved the problem by replacing the encoding RNN with self-attention Transformer blocks ("encoder blocks"), and the decoding RNN with cross-attention causally-masked Transformer blocks ("decoder blocks"). === Priority dispute === One of the papers cited as the originator for seq2seq is (Sutskever et al 2014), published at Google Brain while they were on Google's machine translation project. The research allowed Google to overhaul Google Translate into Google Neural Machine Translation in 2016. Tomáš Mikolov claims to have developed the idea (before joining Google Brain) of using a "neural language model on pairs of sentences... and then [generating] translation after seeing the first sentence"—which he equates with seq2seq machine translation, and to have mentioned the idea to Ilya Sutskever and Quoc Le (while at Google Brain), who failed to acknowledge him in their paper. Mikolov had worked on RNNLM (using RNN for language modelling) for his PhD thesis, and is more notable for developing word2vec. == Architecture == The main reference for this section is. === Encoder === The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and, in a model with attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences. === Decoder === The decoder takes the context vector and hidden states from the encoder and generates the final output sequence. The decoder operates in an autoregressive manner, producing one element of the output sequence at a time. At each step, it considers the previously generated elements, the context vector, and the input sequence information to make predictions for the next element in the output sequence. Specifically, in a model with attention mechanism, the context vector and the hidden state are concatenated together to form an attention hidden vector, which is used as an input for the decoder. The seq2seq method developed in the early 2010s uses two neural networks: an encoder network converts an input sentence into numerical vectors, and a decoder network converts those vectors to sentences in the target language. The Attention mechanism was grafted onto this structure in 2014 and is shown below. Later it was refined into the encoder-decoder Transformer architecture of 2017. === Training vs prediction === There is a subtle difference between training and prediction. During training time, both the input and the output sequences are known. During prediction time, only the input sequence is known, and the output sequence must be decoded by the network itself. Specifically, consider an input sequence x 1 : n {\displaystyle x_{1:n}} and output sequence y 1 : m {\displaystyle y_{1:m}} . The encoder would process the input x 1 : n {\displaystyle x_{1:n}} step by step. After that, the decoder would take the output from the encoder, as well as the as input, and produce a prediction y ^ 1 {\displaystyle {\hat {y}}_{1}} . Now, the question is: what should be input to the decoder in the next step? A standard method for training is "teacher forcing". In teacher forcing, no matter what is output by the decoder, the next input to the decoder is always the reference. That is, even if y ^ 1 ≠ y 1 {\displaystyle {\hat {y}}_{1}\neq y_{1}} , the next input to the decoder is still y 1 {\displaystyle y_{1}} , and so on. During prediction time, the "teacher" y 1 : m {\displaystyle y_{1:m}} would be unavailable. Therefore, the input to the decoder must be y ^ 1 {\displaystyle {\hat {y}}_{1}} , then y ^ 2 {\displaystyle {\hat {y}}_{2}} , and so on. It is found that if a model is trained purely by teacher forcing, its performance would degrade during prediction time, since generation based on the model's own output is different from generation based on the teacher's output. This is called exposure bias or a train/test distribution shift. A 2015 paper recommends that, during training, randomly switch between teacher forcing and no teacher forcing. === Attention for seq2seq === The attention mechanism is an enhancement introduced by Bahdanau et al. in 2014 to address limitations in the basic Seq2Seq architecture where a longer input sequence results in the hidden state output of the encoder becoming irrelevant for the decoder. It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention score using the current decoder state and all of the attention hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by attention hidden state. A softmax function is then applied to the attention score to get the attention weight. In some models, the encoder states are directly fed into an activation function, removing the need for alignment model. An activation function receives one decoder state and one encoder state and returns a scalar value of their relevance. Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control ", which should translate to "la zone de contrôle international ". Here, we use the special token as a control character to delimit the end of input for both the encoder and the decoder. An input sequence of text x 0 , x 1 , … {\displaystyle x_{0},x_{1},\dots } is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , where h {\displaystyle h} stands for "hidden vector". After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence y 0 , y 1 , … {\displaystyle y_{0},y_{1},\dots } , autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word: ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , "") → "la" ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , " la") → "la zone" ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , " la zone") → "la zone de" ... ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , " la zone de contrôle international") → "la zone de contrôle international " Here, we use the special token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "" appears in the decoder output. ==

    Read more →
  • DryvIQ

    DryvIQ

    DryvIQ is a software application that enables businesses to migrate on-site system files and associated data across storage and content management platforms, as well as create synchronized hybrid storage systems. == History == Before it was DryvIQ, the software SkySync was released in 2013 by Ann Arbor, Michigan based company, Portal Architects, Inc. The company created SkySync, a back-end, administrative application designed to transfer content across storage platforms, after abandoning 18 months of development on a desktop application called SkyBrary in 2011. Between 2014 and 2015, Portal Architects established partnerships with the following companies: Autodesk, Box, Dropbox, Egnyte, EMC, Google, Syncplicity, Huddle, IBM, Microsoft, OpenText, Oracle, Citrix ShareFile, Hightail and Internet2. SkySync (currently DryvIQ) was named a "Cool Vendor in Content Management" by Gartner in 2015. In 2022, SkySync changed its name to DryvIQ, which is now what the company is currently known as. == Overview == DryvIQ is a software application that syncs, migrates or backs up files including their associated properties, metadata, versions, user accounts and permissions across on-premises and Cloud-based storage platforms. The software deploys on a server, virtual machine or within Microsoft Azure, Amazon Web Services or other cloud computing services.

    Read more →
  • Convolutional neural network

    Convolutional neural network

    A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. CNNs are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently been replaced—in some cases—by newer architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by the regularization that comes from using shared weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution (or cross-correlation) kernels, only 25 weights for each convolutional layer are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features. Some applications of CNNs include: image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series. CNNs are also known as shift invariant or space invariant artificial neural networks, based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input. Feedforward neural networks are usually fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks makes them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) Robust datasets also increase the probability that CNNs will learn the generalized principles that characterize a given dataset rather than the biases of a poorly-populated set. Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field. CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This simplifies and automates the process, enhancing efficiency and scalability overcoming human-intervention bottlenecks. == Architecture == A convolutional neural network consists of an input layer, hidden layers and an output layer. In a convolutional neural network, the hidden layers include one or more layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers. Here it should be noted how close a convolutional neural network is to a matched filter. === Convolutional layers === In a CNN, the input is a tensor with shape: (number of inputs) × (input height) × (input width) × (input channels) After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape: (number of inputs) × (feature map height) × (feature map width) × (feature map channels). Convolutional layers convolve the input and pass its result to the next layer. This is similar to the response of a neuron in the visual cortex to a specific stimulus. Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs (e.g., high-resolution images), which would require massive numbers of neurons because each pixel is a relevant input feature. A fully connected layer for an image of size 100 × 100 has 10,000 weights for each neuron in the second layer. Convolution reduces the number of free parameters, allowing the network to be deeper. For example, using a 5 × 5 tiling region, each with the same shared weights, requires only 25 neurons. Using shared weights means there are many fewer parameters, which helps avoid the vanishing gradients and exploding gradients problems seen during backpropagation in earlier neural networks. To speed processing, standard convolutional layers can be replaced by depthwise separable convolutional layers, which are based on a depthwise convolution followed by a pointwise convolution. The depthwise convolution is a spatial convolution applied independently over each channel of the input tensor, while the pointwise convolution is a standard convolution restricted to the use of 1 × 1 {\displaystyle 1\times 1} kernels. === Pooling layers === Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 × 2 are commonly used. Global pooling acts on all the neurons of the feature map. There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map, while average pooling takes the average value. === Fully connected layers === Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional multilayer perceptron neural network (MLP). Each neuron in the fully connected layer receives input from all the neurons in the previous layer. These inputs are weighted and summed with the corresponding biases, and then passed through an activation function to perform a nonlinear transformation, generating the output. The flattened matrix goes through a fully connected layer to classify the images. === Receptive field === In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes the value of a pixel into account, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers. To manipulate the receptive field size as desired, there are some alternatives to the standard convolutional layer. For example, atrous or dilated convolution expands the receptive field size without increasing the number of parameters by interleaving visible and blind regions. Moreover, a single dilated convolutional layer can comprise filters with multiple dilation ratios, thus having a variable receptive field size. === Weights === Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights. The vectors of weights and biases are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector

    Read more →
  • T-pose

    T-pose

    In computer animation, a T-pose is a default posing for a humanoid 3D model's skeleton before it is animated. It is called so because of its shape: the straight legs and arms of a humanoid model combine to form a capital letter T. When the arms are angled downwards, the pose is sometimes referred to as an A-pose instead. Likewise, if the arms are angled upward, it is called a Y-pose. Generic terms encompassing all these (especially for non-humanoid models) include bind pose, blind pose, and reference pose. == Usage == The T-pose is primarily used as the default armature pose for skeletal animation in 3D software, which is then manipulated to create animation. The purpose of the T-pose relates to the important elements of the body being axis-aligned, thereby making it easier to rig the model for animation, physics, and other controls. Depending on the exact geometry of the model, other poses such as the A-pose may be more suitable for vertex deformation around areas such as the shoulders. Outside of being default poses in animation software, T-poses are typically used as placeholders for animation not yet completed, particularly in 3D animated video games. In some motion capture software, a T-pose must be assumed by the actor in the motion capture suit before motion capturing can begin. There are other poses used, but the T-pose is the most common one. == As an Internet meme == Starting in 2016 and resurfacing in 2017, the T-pose has become a widespread Internet meme due to its bizarre and somewhat comedic appearance, especially in video game glitches where a character's animation is unexpectedly supplanted by a T-pose. In a prerelease video of the game NBA Elite 11, the demo was filled with glitches, notably one unintentionally showing a T-pose in place of the proper animation for the model of player Andrew Bynum. The glitch later gained fame as the "Jesus Bynum glitch". Publisher EA eventually cancelled the game as they found it unsatisfactory. A similar occurrence happened with Cyberpunk 2077. In the 2023 Formula One season, driver George Russell performed a T-pose in the opening credits of the series' TV broadcasts. This quickly became a meme within the motorsports community. Russell repeated the pose after claiming pole position at the 2024 Canadian Grand Prix and winning the 2024 Austrian Grand Prix.

    Read more →
  • Robot Monk Xian'er

    Robot Monk Xian'er

    Robot Monk Xian'er (Chinese: 贤二机器僧) is a humanoid robot based on the cartoon character Xian'er. It was developed by a team of monks, volunteers and AI experts from Beijing Longquan Monastery in Beijing, China. He can follow human instructions to make body movements, read scriptures and play Buddhist music. He can chat and respond to people's emotional and spiritual questions with Buddhist wisdom. As a chatbot, Robot Monk Xian'er is available on certain public platforms including WeChat and Facebook. Over the years, master Xuecheng, the abbot of Beijing Longquan Monastery, replied to thousands of questions on Sina Weibo. These questions and their answers become the data source of the chatbot.

    Read more →
  • Machine vision

    Machine vision

    Machine vision is the technology and methods used to provide imaging-based automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance, usually in industry. Machine vision refers to many technologies, software and hardware products, integrated systems, actions, methods and expertise. Machine vision as a systems engineering discipline can be considered distinct from computer vision, a form of computer science. It attempts to integrate existing technologies in new ways and apply them to solve real world problems. The term is the prevalent one for these functions in industrial automation environments but is also used for these functions in other environment vehicle guidance. The overall machine vision process includes planning the details of the requirements and project, and then creating a solution. During run-time, the process starts with imaging, followed by automated analysis of the image and extraction of the required information. == Definition == Definitions of the term "Machine vision" vary, but all include the technology and methods used to extract information from an image on an automated basis, as opposed to image processing, where the output is another image. The information extracted can be a simple good-part/bad-part signal, or more a complex set of data such as the identity, position and orientation of each object in an image. The information can be used for such applications as automatic inspection and robot and process guidance in industry, for security monitoring and vehicle guidance. This field encompasses a large number of technologies, software and hardware products, integrated systems, actions, methods and expertise. Machine vision is practically the only term used for these functions in industrial automation applications; the term is less universal for these functions in other environments such as security and vehicle guidance. Machine vision as a systems engineering discipline can be considered distinct from computer vision, a form of basic computer science; machine vision attempts to integrate existing technologies in new ways and apply them to solve real world problems in a way that meets the requirements of industrial automation and similar application areas. The term is also used in a broader sense by trade shows and trade groups such as the Automated Imaging Association and the European Machine Vision Association. This broader definition also encompasses products and applications most often associated with image processing. The primary uses for machine vision are automatic inspection and industrial robot/process guidance. In more recent times the terms computer vision and machine vision have converged to a greater degree. See glossary of machine vision. == Imaging based automatic inspection and sorting == The primary uses for machine vision are imaging-based automatic inspection and sorting and robot guidance.; in this section the former is abbreviated as "automatic inspection". The overall process includes planning the details of the requirements and project, and then creating a solution. This section describes the technical process that occurs during the operation of the solution. === Methods and sequence of operation === The first step in the automatic inspection sequence of operation is acquisition of an image, typically using cameras, lenses, and lighting that has been designed to provide the differentiation required by subsequent processing. MV software packages and programs developed in them then employ various digital image processing techniques to extract the required information, and often make decisions (such as pass/fail) based on the extracted information. === Equipment === The components of an automatic inspection system usually include lighting, a camera or other imager, a processor, software, and output devices. === Imaging === The imaging device (e.g. camera) can either be separate from the main image processing unit or combined with it in which case the combination is generally called a smart camera or smart sensor. Inclusion of the full processing function into the same enclosure as the camera is often referred to as embedded processing. When separated, the connection may be made to specialized intermediate hardware, a custom processing appliance, or a frame grabber within a computer using either an analog or standardized digital interface (Camera Link, CoaXPress). MV implementations also use digital cameras capable of direct connections (without a framegrabber) to a computer via FireWire, USB or Gigabit Ethernet interfaces. While conventional (2D visible light) imaging is most commonly used in MV, alternatives include multispectral imaging, hyperspectral imaging, imaging various infrared bands, line scan imaging, 3D imaging of surfaces and X-ray imaging. Key differentiations within MV 2D visible light imaging are monochromatic vs. color, frame rate, resolution, and whether or not the imaging process is simultaneous over the entire image, making it suitable for moving processes. Though the vast majority of machine vision applications are solved using two-dimensional imaging, machine vision applications utilizing 3D imaging are a growing niche within the industry. The most commonly used method for 3D imaging is scanning based triangulation which utilizes motion of the product or image during the imaging process. A laser is projected onto the surfaces of an object. In machine vision this is accomplished with a scanning motion, either by moving the workpiece, or by moving the camera & laser imaging system. The line is viewed by a camera from a different angle; the deviation of the line represents shape variations. Lines from multiple scans are assembled into a depth map or point cloud. Stereoscopic vision is used in special cases involving unique features present in both views of a pair of cameras. Other 3D methods used for machine vision are time of flight and grid based. One method is grid array based systems using pseudorandom structured light system as employed by the Microsoft Kinect system circa 2012. === Image processing === After an image is acquired, it is processed. Central processing functions are generally done by a CPU, a GPU, a FPGA or a combination of these. Deep learning training and inference impose higher processing performance requirements. Multiple stages of processing are generally used in a sequence that ends up as a desired result. A typical sequence might start with tools such as filters which modify the image, followed by extraction of objects, then extraction (e.g. measurements, reading of codes) of data from those objects, followed by communicating that data, or comparing it against target values to create and communicate "pass/fail" results. Machine vision image processing methods include; Stitching/Registration: Combining of adjacent 2D or 3D images. Filtering (e.g. morphological filtering) Thresholding: Thresholding starts with setting or determining a gray value that will be useful for the following steps. The value is then used to separate portions of the image, and sometimes to transform each portion of the image to simply black and white based on whether it is below or above that grayscale value. Pixel counting: counts the number of light or dark pixels Segmentation: Partitioning a digital image into multiple segments to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Edge detection: finding object edges Color Analysis: Identify parts, products and items using color, assess quality from color, and isolate features using color. Blob detection and extraction: inspecting an image for discrete blobs of connected pixels (e.g. a black hole in a grey object) as image landmarks. Neural network / deep learning / machine learning processing: weighted and self-training multi-variable decision making Circa 2019 there is a large expansion of this, using deep learning and machine learning to significantly expand machine vision capabilities. The most common result of such processing is classification. Examples of classification are object identification,"pass fail" classification of identified objects and OCR. Pattern recognition including template matching. Finding, matching, and/or counting specific patterns. This may include location of an object that may be rotated, partially hidden by another object, or varying in size. Barcode, Data Matrix and "2D barcode" reading Optical character recognition: automated reading of text such as serial numbers Gauging/Metrology: measurement of object dimensions (e.g. in pixels, inches or millimeters) Comparison against target values to determine a "pass or fail" or "go/no go" result. For example, with code or bar code verification, the read value is compared to the stored target value. For gauging, a measurement is compared against the proper value and tolerances. For verification of alpha-numberic codes, the

    Read more →