AI For Np Students

AI For Np Students — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Question answering

    Question answering

    Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language. A question-answering implementation, usually a computer program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, question-answering systems can pull answers from an unstructured collection of natural language documents. Some examples of natural language document collections used for question answering systems include reference texts, compiled newswire reports, Wikipedia pages and other World Wide Web pages. == History == Two early question answering systems were BASEBALL and LUNAR. BASEBALL answered questions about Major League Baseball over a period of one year. LUNAR answered questions about the geological analysis of rocks returned by the Apollo Moon missions. Both question answering systems were very effective in their chosen domains. LUNAR was demonstrated at a lunar science convention in 1971 and it was able to answer 90% of the questions in its domain that were posed by people untrained on the system. Further restricted-domain question answering systems were developed in the following years. The common feature of all these systems is that they had a core database or knowledge system that was hand-written by experts of the chosen domain. The language abilities of BASEBALL and LUNAR used techniques similar to ELIZA and DOCTOR, the first chatterbot programs. SHRDLU was a successful question-answering program developed by Terry Winograd in the late 1960s and early 1970s. It simulated the operation of a robot in a toy world (the "blocks world"), and it offered the possibility of asking the robot questions about the state of the world. The strength of this system was the choice of a very specific domain and a very simple world with rules of physics that were easy to encode in a computer program. In the 1970s, knowledge bases were developed that targeted narrower domains of knowledge. The question answering systems developed to interface with these expert systems produced more repeatable and valid responses to questions within an area of knowledge. These expert systems closely resembled modern question answering systems except in their internal architecture. Expert systems rely heavily on expert-constructed and organized knowledge bases, whereas many modern question answering systems rely on statistical processing of a large, unstructured, natural language text corpus. The 1970s and 1980s saw the development of comprehensive theories in computational linguistics, which led to the development of ambitious projects in text comprehension and question answering. One example was the Unix Consultant (UC), developed by Robert Wilensky at U.C. Berkeley in the late 1980s. The system answered questions pertaining to the Unix operating system. It had a comprehensive, hand-crafted knowledge base of its domain, and it aimed at phrasing the answer to accommodate various types of users. Another project was LILOG, a text-understanding system that operated on the domain of tourism information in a German city. The systems developed in the UC and LILOG projects never went past the stage of simple demonstrations, but they helped the development of theories on computational linguistics and reasoning. Specialized natural-language question answering systems have been developed, such as EAGLi for health and life scientists. Question answering systems have been extended in recent years to encompass additional domains of knowledge For example, systems have been developed to automatically answer temporal and geospatial questions, questions of definition and terminology, biographical questions, multilingual questions, and questions about the content of audio, images, and video. Current question answering research topics include: interactivity—clarification of questions or answers answer reuse or caching semantic parsing answer presentation knowledge representation and semantic entailment social media analysis with question answering systems sentiment analysis utilization of thematic roles Image captioning for visual question answering Embodied question answering In 2011, Watson, a question answering computer system developed by IBM, competed in two exhibition matches of Jeopardy! against Brad Rutter and Ken Jennings, winning by a significant margin. Facebook Research made their DrQA system available under an open source license. This system uses Wikipedia as knowledge source. The open source framework Haystack by deepset combines open-domain question answering with generative question answering and supports the domain adaptation of the underlying language models for industry use cases. Large Language Models (LLMs)[36] like GPT-4[37], Gemini[38] are examples of successful QA systems that are enabling more sophisticated understanding and generation of text. When coupled with Multimodal[39] QA Systems, which can process and understand information from various modalities like text, images, and audio, LLMs significantly improve the capabilities of QA systems. == Types == Question-answering research attempts to develop ways of answering a wide range of question types, including fact, list, definition, how, why, hypothetical, semantically constrained, and cross-lingual questions. Answering questions related to an article in order to evaluate reading comprehension is one of the simpler form of question answering, since a given article is relatively short compared to the domains of other types of question-answering problems. An example of such a question is "What did Albert Einstein win the Nobel Prize for?" after an article about this subject is given to the system. Closed-book question answering is when a system has memorized some facts during training and can answer questions without explicitly being given a context. This is similar to humans taking closed-book exams. Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance) and can exploit domain-specific knowledge frequently formalized in ontologies. Alternatively, "closed-domain" might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information. Question answering systems in the context of machine reading applications have also been constructed in the medical domain, for instance related to Alzheimer's disease. Open-domain question answering deals with questions about nearly anything and can only rely on general ontologies and world knowledge. Systems designed for open-domain question answering usually have much more data available from which to extract the answer. An example of an open-domain question is "What did Albert Einstein win the Nobel Prize for?" while no article about this subject is given to the system. Another way to categorize question-answering systems is by the technical approach used. There are a number of different types of QA systems, including: rule-based systems, statistical systems, and hybrid systems. Rule-based systems use a set of rules to determine the correct answer to a question. Statistical systems use statistical methods to find the most likely answer to a question. Hybrid systems use a combination of rule-based and statistical methods. == Architecture == As of 2001, question-answering systems typically included a question classifier module that determined the type of question and the type of answer. Different types of question-answering systems employ different architectures. For example, modern open-domain question answering systems may use a retriever-reader architecture. The retriever is aimed at retrieving relevant documents related to a given question, while the reader is used to infer the answer from the retrieved documents. Systems such as GPT-3, T5, and BART use an end-to-end architecture in which a transformer-based architecture stores large-scale textual data in the underlying parameters. Such models can answer questions without accessing any external knowledge sources. == Methods == Question answering is dependent on a good search corpus; without documents containing the answer, there is little any question answering system can do. Larger collections generally mean better question answering performance, unless the question domain is orthogonal to the collection. Data redundancy in massive collections, such as the web, means that nuggets of information are likely to be phrased in many different ways in differing contexts and documents, leading to two benefits: If the right information appears in many forms, the question answering system needs to perform fewer complex NLP techniques to understand the text. Correct answers can be filtered from false positives because the syst

    Read more →
  • The Future of Work and Death

    The Future of Work and Death

    The Future of Work and Death is a 2016 documentary by Sean Blacknell and Wayne Walsh about the exponential growth of technology. The film showed at several film festivals including Raindance Film Festival, International Film Festival Rotterdam, Academia Film Olomouc and CPH:DOX. In May 2017 it received an official screening at the European Commission. It was distributed by First Run Features and Journeyman Pictures and was released on iTunes, Amazon Prime and On-demand on 9 May 2017. The film was made available on Sundance Now on 27 November 2017. A companion piece to the film, The Cost of Living, a documentary concerning universal basic income in Britain, was released on Amazon Prime on 8 October 2020. == Synopsis == World experts in the fields of futurology, anthropology, neuroscience, and philosophy consider the impact of technological advances on the two 'certainties' of human life; work and death. Charting human developments from Homo habilis, past the Industrial Revolution, to the digital age and beyond, the film looks at the shocking exponential rate at which mankind has managed to create technologies to ease the process of living. As we embark on the next phase of our adaptation, with automation and artificial intelligence signifying the complete move from man to machine, the film asks what the implications are for human fulfilment in an approaching era of job obsolescence and extreme longevity. == Cast == Dudley Sutton – Narrator Aubrey de Grey – Biomedical gerontologist and CSO of the SENS Research Foundation Will Self – Writer, journalist, political commentator and Professor of Contemporary Thought at Brunel University Rudolph E. Tanzi – Professor of Neurology at Harvard University and Director of the Genetics and Aging Research Unit at Massachusetts General Hospital (MGH) Martin Ford – Futurist and author Steve Fuller – Auguste Comte Chair in Social Epistemology at the Department of sociology at University of Warwick Murray Shanahan – Professor of Cognitive Robotics at Imperial College London Gray Scott – Futurist, executive producer of this production Vivek Wadhwa – Entrepreneur, academic and Director of Research at the Center for Entrepreneurship and Research Commercialization at the Pratt School of Engineering, Duke University Zoltan Istvan – Transhumanist and journalist Joanna Cook – Anthropologist, University College London Nicholas Kamara – Physician, Kable Hospital David Pearce – Transhumanist philosopher and co-founder of Humanity+ Peter Cochrane – Futurist and entrepreneur John Harris – Bioethicist, philosopher and Director of the Institute for Science, Ethics and Innovation at the University of Manchester Riva Melissa-Tez – Entrepreneur and transhumanist Ian Pearson – Futurologist Stuart Armstrong – Artificial intelligence researcher at Future of Humanity Institute

    Read more →
  • Reverse correlation technique

    Reverse correlation technique

    The reverse correlation technique is a data driven study method used primarily in psychological and neurophysiological research. This method earned its name from its origins in neurophysiology, where cross-correlations between white noise stimuli and sparsely occurring neuronal spikes could be computed quicker when only computing it for segments preceding the spikes. The term has since been adopted in psychological experiments that usually do not analyze the temporal dimension, but also present noise to human participants. In contrast to the original meaning, the term is here thought to reflect that the standard psychological practice of presenting stimuli of defined categories to the participants is "reversed": Instead, the participant's mental representations of categories are estimated from interactions of the presented noise and the behavioral responses. It is used to create composite pictures of individual and/or group mental representations of various items (e.g. faces, bodies, and the self) that depict characteristics of said items (e.g. trustworthiness and self-body image). This technique is helpful when evaluating the mental representations of those with and without mental illnesses. == Terms == This technique utilizes spike-triggered average to explain what areas of signal and noise in an image are valuable for the given research question. Signal is information used to produce objects of value that help explain and connect the world around us. Noise is commonly referred to as unwanted signal that obscures the information that the signal is trying to present. Most importantly for reverse correlation studies, noise is randomly varying information. To determine the areas of importance using reverse correlation, noise is applied to a base image and then evaluated by observers. A base image is any image void of noise that relates to the research question. A base image that has noise superimposed on top is the stimuli that is presented to and evaluated by participants. Each time a new set of stimuli is presented to a participant, this is known as a trial. After a participant has responded to hundreds to thousands of trials, a researcher is ready to create a classification image. A classification image (abbreviated as "CI" in some studies) is a single image that represents the average noise patterns in the images selected by participants. A classification image can also be computed for groups by averaging the individuals’ classification images. These classification images are what researchers use to interpret the data and draw conclusions. As a whole, the reverse correlation method is a process that results in a composite image (from an individual or group) that can be used to estimate and interpret mental representations. == Basic study layout == The reverse correlation method is typically executed as an in-lab computer experiment. This method follows four broad steps. Each of the following steps are described in greater detail below. After creating a research question and determining that the reverse correlation method is the most suitable technique to answer the question, a researcher must (1) design randomly varying stimuli. After the stimuli have been prepared, a researcher should (2) collect data from participants who will see and respond to approximately 300 -1,000 trials. Each trial will either consist of one or two images (side by side) derived from the same base image with noise superimposed on top. Participant responses will depend on the chosen study design; if a researcher presents only one image at a time, participants rate the image on a 4pt scale, but when two images are shown, the participant is asked to choose which best aligns with the given category (e.g. choose the image that looks the most aggressive). Once all of the data is collected, the researcher will (3) compute classification images for each participant and using those images compute group classification images. Finally, with the classification images available, the researcher will (4) evaluate the images and draw conclusions about their results. === Step 1: making stimuli === When designing the stimuli for a reverse correlation study, the two primary factors that one should consider are (1) the base image and (2) the noise that will be used. While not all bases are images per se, the majority are and for this reason the base is typically referred to as a base image. The base image should represent whatever the research question is addressing. For example, if you are interested in peoples’ mental representations of Chinese people, it would not make sense to use a base image of a Spanish or Caucasian person. Again, if you are interested in the mental representations of male vocal patterns, it would make the most sense to use a base vocal pattern that has been produced by a male. Having a base is important because it provides a kind of anchor for participants to work from. When there is no base image, the number of trials that are required increases dramatically, thus making it harder to collect data. While there are studies that have excluded a base image, (e.g. the S study), for more elaborate and nuanced research questions, it is important to have a base image that is a fair representation of what participants are being asked to categorize. Photographs of faces are generally the most popular base image. Although the reverse correlation method is capable of investigating a wide variety of research questions, the most common application of the method is for evaluating faces on a single trait. Reverse correlation studies that address evaluations of the face are sometimes referred to as being a face space reverse correlation model (FSRCM). Thankfully, there are existing databases for face images of varying demographics and emotion that work well as base images. The reverse correlation method can also be used to help researchers identify what areas of an image (e.g. the areas on the face) have diagnostic value. In order to identify these areas of value, researchers start by minimizing the space a participant can pull information from. By imposing a “mask” on an image (e.g. blur an image while leaving random areas un-blurred), this reduces the information individuals might see, and forces them to focus on certain areas. Then, if/when participants are able to correctly identify an image with a trait repeatedly, we can draw conclusions about what areas have diagnostic value. While faces and visual stimuli are the most popular, this is not the only stimuli that can be used in a reverse correlation study. This method was originally designed for auditory stimuli which allows researchers to investigate how perceivers interpret auditory information and create trait based attributions to different sound patterns. For example, by segmenting a vocal recording of a single word (total sound time 426 ms) into six segments (71 ms each), and varying each segment's pitch using Gaussian distributions, researchers were able to uncover what vocal patterns people associated with certain traits. Specifically, this study investigated how listeners rated sound clips of the word “really” as sounding more interrogative (i.e. like the more common reverse correlation studies this study had participants listen to two sound clips per trial, choose which fit the category the best, and then created an average of the pitch contours). Beyond face and auditory perception, research utilizing the reverse correlation method has expanded to investigate how individuals see three-dimensional objects in images with noise (but no signal). After selecting your base image, regardless of what the image is, it is helpful to apply a Gaussian blur to smooth noise in the image. While noise will be applied later, it is helpful to reduce existing noise in the photo before applying your chosen noise. There are three primary choices when it comes to noise: white noise, sine-wave noise, and Gabor noise. The latter two of these constrain the configurations that the noise can have, and because of this white noise is usually the most commonly used. Regardless of the type of noise that is chosen, it is crucial that the noise randomly varies. === Step 2: data collection === Once the stimuli for the study has been developed, the researcher must make a few decisions before actually collecting the data. The researcher must come to a conclusion on how many stimuli will be presented at a time and how many trials the participants will see. In terms of stimuli presentation, a researcher can choose from either a 2-Image Forced Choice (2IFC) or a 4-Alternative Forced Choice (4AFC). The 2IFC presents two images at once (side by side) and requires participants to choose between the two on a specified category (e.g. which image looks the most like a male). Typically the noise from the left image is the mathematical inverse of the noise from the right image. This method was developed to better answer questions that could n

    Read more →
  • OpenPipeline

    OpenPipeline

    openPipeline is an open-source plug-in for Autodesk Maya that is designed to assist in a Production Pipeline structure and Computer animation. == Development == Created in Maya Embedded Language, openPipeline was initiated at Eyebeam Atelier and further developed at Pratt Institute in the Digital Arts Lab. The initial release date was December 28, 2006. == Contributors == Rob O'Neill (Creator) Paris Mavroidis Meng-Han Ho

    Read more →
  • Bag-of-words model

    Bag-of-words model

    The bag-of-words (BoW) model is a model of text which uses an unordered collection (a "bag") of words. It is used in natural language processing and information retrieval (IR). It disregards word order (and thus most of syntax or grammar) but captures multiplicity. The bag-of-words model is commonly used in methods of document classification where, for example, the (frequency of) occurrence of each word is used as a feature for training a classifier. It has also been used for computer vision. An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure. == Definition == The following models a text document using bag-of-words. Here are two simple text documents: Based on these two text documents, a list is constructed as follows for each document: Representing each bag-of-words as a JSON object, and attributing to the respective JavaScript variable: Each key is the word, and each value is the number of occurrences of that word in the given text document. The order of elements is free, so, for example {"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1} is also equivalent to BoW1. It is also what we expect from a strict JSON object representation. Note: if another document is like a union of these two, its JavaScript representation will be: So, as we see in the bag algebra, the "union" of two documents in the bags-of-words representation is, formally, the disjoint union, summing the multiplicities of each element. === Word order === The BoW representation of a text removes all word ordering. For example, the BoW representation of "man bites dog" and "dog bites man" are the same, so any algorithm that operates with a BoW representation of text must treat them in the same way. Despite this lack of syntax or grammar, BoW representation is fast and may be sufficient for simple tasks that do not require word order. For instance, for document classification, if the words "stocks" "trade" "investors" appears multiple times, then the text is likely a financial report, even though it would be insufficient to distinguish between Yesterday, investors were rallying, but today, they are retreating.andYesterday, investors were retreating, but today, they are rallying.and so the BoW representation would be insufficient to determine the detailed meaning of the document. == Implementations == Implementations of the bag-of-words model might involve using frequencies of words in a document to represent its contents. The frequencies can be "normalized" by the inverse of document frequency, or tf–idf. Additionally, for the specific purpose of classification, supervised alternatives have been developed to account for the class label of a document. Lastly, binary (presence/absence or 1/0) weighting is used in place of frequencies for some problems (e.g., this option is implemented in the WEKA machine learning software system). == Hashing trick == A common alternative to using dictionaries is the hashing trick, where words are mapped directly to indices with a hash function. When using a hash function, no memory is required to store a dictionary. In practice, hashing simplifies the implementation of bag-of-words models and improves scalability. Collisions can occur when two words are hashed to the same index, but this happens infrequently and may function as a form of regularization.

    Read more →
  • Network Abstraction Layer

    Network Abstraction Layer

    The Network Abstraction Layer (NAL) is a part of the H.264/AVC and HEVC video coding standards. The main goal of the NAL is the provision of a "network-friendly" video representation addressing "conversational" (video telephony) and "non conversational" (storage, broadcast, or streaming) applications. NAL has achieved a significant improvement in application flexibility relative to prior video coding standards. == Introduction == An increasing number of services and growing popularity of high definition TV are creating greater needs for higher coding efficiency. Moreover, other transmission media such as cable modem, xDSL, or UMTS offer much lower data rates than broadcast channels, and enhanced coding efficiency can enable the transmission of more video channels or higher quality video representations within existing digital transmission capacities. Video coding for telecommunication applications has diversified from ISDN and T1/E1 service to embrace PSTN, mobile wireless networks, and LAN/Internet network delivery. Throughout this evolution, continued efforts have been made to maximize coding efficiency while dealing with the diversification of network types and their characteristic formatting and loss/error robustness requirements. The H.264/AVC and HEVC standards are designed for technical solutions including areas like broadcasting (over cable, satellite, cable modem, DSL, terrestrial, etc.) interactive or serial storage on optical and magnetic devices, conversational services, video-on-demand or multimedia streaming, multimedia messaging services, etc. Moreover, new applications may be deployed over existing and future networks. This raises the question about how to handle this variety of applications and networks. To address this need for flexibility and customizability, the design covers a NAL that formats the Video Coding Layer (VCL) representation of the video and provides header information in a manner appropriate for conveyance by a variety of transport layers or storage media. The NAL is designed in order to provide "network friendliness" to enable simple and effective customization of the use of VCL for a broad variety of systems. The NAL facilitates the ability to map VCL data to transport layers such as: RTP/IP for any kind of real-time wire-line and wireless Internet services. File formats, e.g., ISO MP4 for storage and MMS. H.32X for wireline and wireless conversational services. MPEG-2 systems for broadcasting services, etc. The full degree of customization of the video content to fit the needs of each particular application is outside the scope of the video coding standardization effort, but the design of the NAL anticipates a variety of such mappings. Some key concepts of the NAL are NAL units, byte stream, and packet formats uses of NAL units, parameter sets, and access units. A short description of these concepts is given below. == NAL units == The coded video data is organized into NAL units, each of which is effectively a packet that contains an integer number of bytes. The first byte of each H.264/AVC NAL unit is a header byte that contains an indication of the type of data in the NAL unit. For HEVC the header was extended to two bytes. All the remaining bytes contain payload data of the type indicated by the header. The NAL unit structure definition specifies a generic format for use in both packet-oriented and bitstream-oriented transport systems, and a series of NAL units generated by an encoder is referred to as a NAL unit stream. == NAL Units in Byte-Stream Format Use == Some systems require delivery of the entire or partial NAL unit stream as an ordered stream of bytes or bits within which the locations of NAL unit boundaries need to be identifiable from patterns within the coded data itself. For use in such systems, the H.264/AVC and HEVC specifications define a byte stream format. In the byte stream format, each NAL unit is prefixed by a specific pattern of three bytes called a start code prefix. The boundaries of the NAL unit can then be identified by searching the coded data for the unique start code prefix pattern. The use of emulation prevention bytes guarantees that start code prefixes are unique identifiers of the start of a new NAL unit. A small amount of additional data (one byte per video picture) is also added to allow decoders that operate in systems that provide streams of bits without alignment to byte boundaries to recover the necessary alignment from the data in the stream. Additional data can also be inserted in the byte stream format that allows expansion of the amount of data to be sent and can aid in achieving more rapid byte alignment recovery, if desired. == NAL Units in Packet-Transport System Use == In other systems (e.g., IP/RTP systems), the coded data is carried in packets that are framed by the system transport protocol, and identification of the boundaries of NAL units within the packets can be established without use of start code prefix patterns. In such systems, the inclusion of start code prefixes in the data would be a waste of data carrying capacity, so instead the NAL units can be carried in data packets without start code prefixes. == VCL and Non-VCL NAL Units == NAL units are classified into VCL and non-VCL NAL units. VCL NAL units contain the data that represents the values of the samples in the video pictures. Non-VCL NAL units contain any associated additional information such as parameter sets (important header data that can apply to a large number of VCL NAL units) and supplemental enhancement information (timing information and other supplemental data that may enhance usability of the decoded video signal but are not necessary for decoding the values of the samples in the video pictures). == Parameter Sets == A parameter set contains shared configuration data that is carried in non-VCL NAL units. Parameter sets are typically reused when decoding many coded pictures within a video sequence. Each VCL NAL unit references a picture parameter set (PPS), which in turn references a sequence parameter set (SPS). There are two types of parameter sets: Sequence parameter set (SPS), which specifies mostly constant configuration such as resolution, bit depth, or chroma format. (For a concrete implementation, see FFmpeg's SPS struct.) Picture parameter set (PPS), which applies on top of an SPS, and specifies configuration such as QP offsets. (For a concrete implementation, see FFmpeg's PPS struct.) The sequence and picture parameter-set mechanism decouples the transmission of infrequently changing information from the transmission of coded representations of the values of the samples in the video pictures. Each VCL NAL unit contains an identifier that refers to the content of the relevant picture parameter set and each picture parameter set contains an identifier that refers to the content of the relevant sequence parameter set. In this manner, a small amount of data (the identifier) can be used to refer to a larger amount of information (the parameter set) without repeating that information within each VCL NAL unit. Sequence and picture parameter sets can be sent well ahead of the VCL NAL units that they apply to, and can be repeated to provide robustness against data loss. In some applications, parameter sets may be sent within the channel that carries the VCL NAL units (termed "in-band" transmission). In other applications, it can be advantageous to convey the parameter sets "out-of-band" using a more reliable transport mechanism than the video channel itself. == Access Units == A set of NAL units in a specified form is referred to as an access unit. The decoding of each access unit results in one decoded picture. Each access unit contains a set of VCL NAL units that together compose a primary coded picture. It may also be prefixed with an access unit delimiter to aid in locating the start of the access unit. Some supplemental enhancement information containing data such as picture timing information may also precede the primary coded picture. The primary coded picture consists of a set of VCL NAL units consisting of slices or slice data partitions that represent the samples of the video picture. Following the primary coded picture may be some additional VCL NAL units that contain redundant representations of areas of the same video picture. These are referred to as redundant coded pictures, and are available for use by a decoder in recovering from loss or corruption of the data in the primary coded pictures. Decoders are not required to decode redundant coded pictures if they are present. Finally, if the coded picture is the last picture of a coded video sequence (a sequence of pictures that is independently decodable and uses only one sequence parameter set), an end of sequence NAL unit may be present to indicate the end of the sequence; and if the coded picture is the last coded picture in the entire NAL unit stream, an end of stream NAL unit may be present to

    Read more →
  • Speech recognition

    Speech recognition

    Speech recognition (automatic speech recognition (ASR), computer speech recognition, or speech-to-text (STT)) is a sub-field of computational linguistics concerned with methods and technologies that translate spoken language into text or other interpretable forms. Speech recognition applications include voice user interfaces, where the user speaks to a device, which "listens" and processes the audio. Common voice applications include interpreting commands for calling, call routing, home automation, and aircraft control. These applications are called direct voice input. Productivity applications include searching audio recordings, creating transcripts, and dictation. Speech recognition can be used to analyse speaker characteristics, such as identifying native language using pronunciation assessment. Voice recognition (speaker identification) refers to identifying the speaker, rather than speech contents. Recognizing the speaker can simplify the task of translating speech in systems trained on a specific person's voice. It can also be used to authenticate the speaker as part of a security process. == History == Applications for speech recognition developed over many decades, with progress accelerated due to advances in deep learning and the use of big data. These advances are reflected in an increase in academic papers, and greater system adoption. Key areas of growth include vocabulary size, more accurate recognition for unfamiliar speakers (speaker independence), and faster processing speed. === Pre-1970 === 1952 – Bell Labs researchers, Stephen Balashek, R. Biddulph, and K. H. Davis, built Audrey for single-speaker digit recognition. Their system located the formants in the power spectrum of each utterance. 1960 – Gunnar Fant developed and published the source–filter model of speech production. 1962 – IBM's 16-word "Shoebox" machine's speech recognition debuted at the 1962 World's Fair. 1966 – Linear predictive coding, a speech coding method, was proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone. 1969 – Funding at Bell Labs came to a halt for several years after the company's head engineer, John R. Pierce, wrote an open letter criticizing speech recognition research. This defunding lasted until Pierce retired and James L. Flanagan took over. Raj Reddy was the first person to work on continuous speech recognition, as a graduate student at Stanford University in the late 1960s. Previous systems required users to pause after each word. Reddy's system issued spoken commands for playing chess. Around this time, Soviet researchers invented the dynamic time warping (DTW) algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary. DTW processed speech by dividing it into short frames (e.g. 10 ms segments) and treating each frame as a unit. Speaker independence, however, remained unsolved. === 1970–1990 === 1971 – DARPA funded a five-year speech recognition research project, Speech Understanding Research, seeking a minimum vocabulary size of 1,000 words. The project considered speech understanding a key to achieving progress in speech recognition, which was later disproved. BBN, IBM, Carnegie Mellon (CMU), and Stanford Research Institute participated. 1972 – The IEEE Acoustics, Speech, and Signal Processing group held a conference in Newton, Massachusetts. 1976 – The first ICASSP was held in Philadelphia, which became a major venue for publishing on speech recognition. During the late 1960s, Leonard Baum developed the mathematics of Markov chains at the Institute for Defense Analysis. A decade later, at CMU, Raj Reddy's students James Baker and Janet M. Baker began using the hidden Markov model (HMM) for speech recognition. James Baker had learned about HMMs while at the Institute for Defense Analysis. HMMs enabled researchers to combine sources of knowledge, such as acoustics, language, and syntax, in a unified probabilistic model. By the mid-1980s, Fred Jelinek's team at IBM created a voice-activated typewriter called Tangora, which could handle a 20,000-word vocabulary. Jelinek's statistical approach placed less emphasis on emulating human brain processes in favor of statistical modelling. (Jelinek's group independently discovered the application of HMMs to speech.) This was controversial among linguists since HMMs are too simplistic to account for many features of human languages. However, the HMM proved to be a highly useful way for modelling speech and replaced dynamic time warping as the dominant speech recognition algorithm in the 1980s. 1982 – Dragon Systems, founded by James and Janet M. Baker, was one of IBM's few competitors. === Practical speech recognition === The 1980s also saw the introduction of the n-gram language model. 1987 – The back-off model enabled language models to use multiple-length n-grams, and CSELT used HMM to recognize languages (in software and hardware, e.g. RIPAC). At the end of the DARPA program in 1976, the best computer available to researchers was the PDP-10 with 4 MB of RAM. It could take up to 100 minutes to decode 30 seconds of speech. Practical products included: 1984 – the Apricot Portable was released with up to 4096 words support, of which only 64 could be held in RAM at a time. 1987 – a recognizer from Kurzweil Applied Intelligence 1990 – Dragon Dictate, a consumer product released in 1990. AT&T deployed the Voice Recognition Call Processing service in 1992 to route telephone calls without a human operator. The technology was developed by Lawrence Rabiner and others at Bell Labs. By the early 1990s, the vocabulary of the typical commercial speech recognition system had exceeded the average human vocabulary. Reddy's former student, Xuedong Huang, developed the Sphinx-II system at CMU. Sphinx-II was the first to do speaker-independent, large vocabulary, continuous speech recognition, and it won DARPA's 1992 evaluation. Handling continuous speech with a large vocabulary was a major milestone. Huang later founded the speech recognition group at Microsoft in 1993. Reddy's student Kai-Fu Lee joined Apple, where, in 1992, he helped develop the Casper speech interface prototype. Lernout & Hauspie, a Belgium-based speech recognition company, acquired other companies, including Kurzweil Applied Intelligence in 1997 and Dragon Systems in 2000. L&H was used in Windows XP. L&H was an industry leader until an accounting scandal destroyed it in 2001. L&H speech technology was bought by ScanSoft, which became Nuance in 2005. Apple licensed Nuance software for its digital assistant Siri. ==== 2000s ==== In the 2000s, DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002, followed by Global Autonomous Language Exploitation (GALE) in 2005. Four teams participated in EARS: IBM; a team led by BBN with LIMSI and the University of Pittsburgh; Cambridge University; and a team composed of ICSI, SRI, and the University of Washington. EARS funded the collection of the Switchboard telephone speech corpus, which contained 260 hours of recorded conversations from over 500 speakers. The GALE program focused on Arabic and Mandarin broadcast news. Google's first effort at speech recognition came in 2007 after recruiting Nuance researchers. Its first product, GOOG-411, was a telephone-based directory service. Since at least 2006, the U.S. National Security Agency has employed keyword spotting, allowing analysts to index large volumes of recorded conversations and identify speech containing "interesting" keywords. Other government research programs focused on intelligence applications, such as DARPA's EARS program and IARPA's Babel program. In the early 2000s, speech recognition was dominated by hidden Markov models combined with feed-forward artificial neural networks (ANN). Later, speech recognition was taken over by long short-term memory (LSTM), a recurrent neural network (RNN) published by Sepp Hochreiter & Jürgen Schmidhuber in 1997. LSTM RNNs avoid the vanishing gradient problem and can learn "Very Deep Learning" tasks that require memories of events that happened thousands of discrete time steps earlier, which is important for speech. Around 2007, LSTMs trained with Connectionist Temporal Classification (CTC) began to outperform. In 2015, Google reported a 49 percent error‑rate reduction in its speech recognition via CTC‑trained LSTM. Transformers, a type of neural network based solely on attention, were adopted in computer vision and language modelling, and then to speech recognition. Deep feed-forward (non-recurrent) networks for acoustic modelling were introduced in 2009 by Geoffrey Hinton and his students at the University of Toronto, and by Li Deng and colleagues at Microsoft Research. In contrast to the prioer incremental improvements, deep learning decreased error rates by 30%. Both shallow and deep forms (e.g., recurrent nets) of ANNs had been explored since the 1980s. Howev

    Read more →
  • Video renderer

    Video renderer

    A video renderer is software that processes a video file and sends it sequentially to the video display controller card for display on a computer screen. An example of a video renderer, is the VMR-7 that was used by Microsoft's DirectShow. An example of a UNIX video renderer is the one container within GStreamer. Commonly used video renderers are: Enhanced Video Renderer VMR9 Renderless Haali's Video Renderer Madvr Video Renderer JRVR, a part of JRiver Media Center

    Read more →
  • Data preprocessing

    Data preprocessing

    Data preprocessing can refer to manipulation, filtration or augmentation of data before it is analyzed, and is often an important step in the data mining process. Data collection methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, amongst other issues. Preprocessing is the process by which unstructured data is transformed into intelligible representations suitable for machine-learning models. This phase of model deals with noise in order to arrive at better and improved results from the original data set which was noisy. This dataset also has some level of missing value present in it. The preprocessing pipeline used can often have large effects on the conclusions drawn from the downstream analysis. Thus, representation and quality of data is necessary before running any analysis. If there is a high proportion of irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase may be more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Examples of methods used in data preprocessing include cleaning, instance selection, normalization, one-hot encoding, data transformation, feature extraction and feature selection. == Applications == === Data mining === Data preprocessing allows for the removal of unwanted data with the use of data cleaning, this allows the user to have a dataset to contain more valuable information after the preprocessing stage for data manipulation later in the data mining process. Editing such dataset to either correct data corruption or human error is a crucial step to get accurate quantifiers like true positives, true negatives, false positives and false negatives found in a confusion matrix that are commonly used for a medical diagnosis. Users are able to join data files together and use preprocessing to filter any unnecessary noise from the data which can allow for higher accuracy. Users use Python programming scripts accompanied by the pandas library which gives them the ability to import data from a comma-separated values as a data-frame. The data-frame is then used to manipulate data that can be challenging otherwise to do in Excel. Pandas (software) which is a powerful tool that allows for data analysis and manipulation; which makes data visualizations, statistical operations and much more, a lot easier. Many also use the R programming language to do such tasks as well. The reason why a user transforms existing files into a new one is because of many reasons. Aspects of data preprocessing may include imputing missing values, aggregating numerical quantities and transforming continuous data into categories (data binning). More advanced techniques like principal component analysis and feature selection are working with statistical formulas and are applied to complex datasets which are recorded by GPS trackers and motion capture devices. === Semantic data preprocessing === Semantic data mining is a subset of data mining that specifically seeks to incorporate domain knowledge, such as formal semantics, into the data mining process. Domain knowledge is the knowledge of the environment the data was processed in. Domain knowledge can have a positive influence on many aspects of data mining, such as filtering out redundant or inconsistent data during the preprocessing phase. Domain knowledge also works as constraint. It does this by using working as set of prior knowledge to reduce the space required for searching and acting as a guide to the data. Simply put, semantic preprocessing seeks to filter data using the original environment of said data more correctly and efficiently. There are increasingly complex problems which are asking to be solved by more elaborate techniques to better analyze existing information. Instead of creating a simple script for aggregating different numerical values into a single value, it make sense to focus on semantic based data preprocessing. The idea is to build a dedicated ontology, which explains on a higher level what the problem is about. In regards to semantic data mining and semantic pre-processing, ontologies are a way to conceptualize and formally define semantic knowledge and data. The Protégé (software) is the standard tool for constructing an ontology. In general, the use of ontologies bridges the gaps between data, applications, algorithms, and results that occur from semantic mismatches. As a result, semantic data mining combined with ontology has many applications where semantic ambiguity can impact the usefulness and efficiency of data systems. Applications include the medical field, language processing, banking, and even tutoring, among many more. There are various strengths to using a semantic data mining and ontological based approach. As previously mentioned, these tools can help during the per-processing phase by filtering out non-desirable data from the data set. Additionally, well-structured formal semantics integrated into well designed ontologies can return powerful data that can be easily read and processed by machines. A specifically useful example of this exists in the medical use of semantic data processing. As an example, a patient is having a medical emergency and is being rushed to hospital. The emergency responders are trying to figure out the best medicine to administer to help the patient. Under normal data processing, scouring all the patient’s medical data to ensure they are getting the best treatment could take too long and risk the patients’ health or even life. However, using semantically processed ontologies, the first responders could save the patient’s life. Tools like a semantic reasoner can use ontology to infer the what best medicine to administer to the patient is based on their medical history, such as if they have a certain cancer or other conditions, simply by examining the natural language used in the patient's medical records. This would allow the first responders to quickly and efficiently search for medicine without having worry about the patient’s medical history themselves, as the semantic reasoner would already have analyzed this data and found solutions. In general, this illustrates the incredible strength of using semantic data mining and ontologies. They allow for quicker and more efficient data extraction on the user side, as the user has fewer variables to account for, since the semantically pre-processed data and ontology built for the data have already accounted for many of these variables. However, there are some drawbacks to this approach. Namely, it requires a high amount of computational power and complexity, even with relatively small data sets. This could result in higher costs and increased difficulties in building and maintaining semantic data processing systems. This can be mitigated somewhat if the data set is already well organized and formatted, but even then, the complexity is still higher when compared to standard data processing. Below is a simple a diagram combining some of the processes, in particular semantic data mining and their use in ontology. The diagram depicts a data set being broken up into two parts: the characteristics of its domain, or domain knowledge, and then the actual acquired data. The domain characteristics are then processed to become user understood domain knowledge that can be applied to the data. Meanwhile, the data set is processed and stored so that the domain knowledge can applied to it, so that the process may continue. This application forms the ontology. From there, the ontology can be used to analyze data and process results. Fuzzy preprocessing is another, more advanced technique for solving complex problems. Fuzzy preprocessing and fuzzy data mining make use of fuzzy sets. These data sets are composed of two elements: a set and a membership function for the set which comprises 0 and 1. Fuzzy preprocessing uses this fuzzy data set to ground numerical values with linguistic information. Raw data is then transformed into natural language. Ultimately, fuzzy data mining's goal is to help deal with inexact information, such as an incomplete database. Currently fuzzy preprocessing, as well as other fuzzy based data mining techniques see frequent use with neural networks and artificial intelligence.

    Read more →
  • PCVC Speech Dataset

    PCVC Speech Dataset

    The PCVC (Persian Consonant Vowel Combination) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. The dataset contains sound samples of Modern Persian combination of vowel and consonant phonemes from different speakers. Every sound sample contains just one consonant and one vowel So it is somehow labeled in phoneme level. This dataset consists of 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker). The sample rate of all speech samples is 48000 which means there are 48000 sound samples in every 1 second. Every sound sample starts with consonant then continues with vowel. In each sample, in average, 0.5 second of each sample is speech and the rest is silence. Each sound sample ends with silence. All of sound samples are denoised with "Adaptive noise reduction" algorithm. Compared to Farsdat speech dataset and Persian speech corpus it is more easy to use because it is prepared in .mat data files. Also it is more based on phoneme based separation and all samples are denoised. == Contents == The corpus is downloadable from its Kaggle web page, and contains the following: .mat data files of sound samples in a 23630000 matrix, in which 23 is number of consonants, 6 is the number of vowels and 30000 is the length of sound sample.

    Read more →
  • Multi-focus image fusion

    Multi-focus image fusion

    Multi-focus image fusion is a multiple image compression technique using input images with different focus depths to make one output image that preserves all information. == Overview == The main idea of image fusion is gathering important and the essential information from the input images into one single image which ideally has all of the information of the input images. The research history of image fusion spans over 30 years and many scientific papers. Image fusion generally has two aspects: image fusion methods and objective evaluation metrics. In visual sensor networks (VSN), sensors are cameras which record images and video sequences. In many applications of VSN, a camera can't give a perfect illustration including all details of the scene. This is because of the limited depth of focus of the optical lens of cameras. Therefore, just the object located in the focal length of camera is focused and clear, and other parts of the image are blurred. VSN captures images with different depths of focus using several cameras. Due to the large amount of data generated by cameras compared to other sensors such as pressure and temperature sensors and some limitations of bandwidth, energy consumption and processing time, it is essential to process the local input images to decrease the amount of transmitted data. == Multi-Focus image fusion in the spatial domain == Huang and Jing have reviewed and applied several focus measurements in the spatial domain for the multi-focus image fusion process, suitable for real-time applications. They mentioned some focus measurements including variance, energy of image gradient (EOG), Tenenbaum's algorithm (Tenengrad), energy of Laplacian (EOL), sum-modified-Laplacian (SML), and spatial frequency (SF). Their experiments showed that EOL gave better results than other methods like variance and spatial frequency. == Multi-Focus image fusion in multi-scale transform and DCT domain == Image fusion based on the multi-scale transform is the most commonly used and promising technique. Laplacian pyramid transform, gradient pyramid-based transform, morphological pyramid transform and the premier ones, discrete wavelet transform, shift-invariant wavelet transform (SIDWT), and discrete cosine harmonic wavelet transform (DCHWT) are some examples of image fusion methods based on multi-scale transform. These methods are complex and have some limitations e.g. processing time and energy consumption. For example, multi-focus image fusion methods based on DWT require a lot of convolution operations, so they take more time and energy to process. Therefore, most methods in multi-scale transform are not suitable for real-time applications. Moreover, these methods are not very successful along edges, due to the wavelet transform process missing the edges of the image. They create ringing artefacts in the output image and reduce its quality. Due to the aforementioned problems in the multi-scale transform methods, researchers are interested in multi-focus image fusion in the DCT domain. DCT-based methods are more efficient in terms of transmission and archiving images coded in Joint Photographic Experts Group (JPEG) standard to the upper node in the VSN agent. A JPEG system consists of a pair of an encoder and a decoder. In the encoder, images are divided into non-overlapping 8×8 blocks, and the DCT coefficients are calculated for each. Since the quantization of DCT coefficients is a lossy process, many of the small-valued DCT coefficients are quantized to zero, which corresponds to high frequencies. DCT-based image fusion algorithms work better when the multi-focus image fusion methods are applied in the compressed domain. In addition, in the spatial-based methods, the input images must be decoded and then transferred to the spatial domain. After implementation of the image fusion operations, the output fused images must again be encoded. DCT domain-based methods do not require complex and time-consuming consecutive decoding and encoding operations. Therefore, the image fusion methods based on DCT domain operate with much less energy and processing time. Recently, a lot of research has been carried out in the DCT domain. DCT+Variance, DCT+Corr_Eng, DCT+EOL, and DCT+VOL are some prominent examples of DCT based methods.

    Read more →
  • The Triple Revolution

    The Triple Revolution

    "The Triple Revolution" was an open memorandum sent to U.S. President Lyndon B. Johnson and other government figures on March 22, 1964. It concerned three megatrends of the time: increasing use of automation, the nuclear arms race, and advancements in human rights. Drafted under the auspices of the Center for the Study of Democratic Institutions, it was signed by an array of noted social activists, professors, and technologists who identified themselves as the Ad Hoc Committee on the Triple Revolution. The chief initiator of the proposal was W. H. "Ping" Ferry, at that time a vice-president of CSDI, basing it in large part on the ideas of the futurist Robert Theobald. == Overview == The statement identified three revolutions underway in the world: the cybernation revolution of increasing automation; the weaponry revolution of mutually assured destruction; and the human rights revolution. It discussed primarily the cybernation revolution. The committee claimed that machines would usher in "a system of almost unlimited productive capacity" while continually reducing the number of manual laborers needed, and increasing the skill needed to work, thereby producing increasing levels of unemployment. It proposed that the government should ease this transformation through large-scale public works, low-cost housing, public transit, electrical power development, income redistribution, union representation for the unemployed, and government restraint on technology deployment. == Legacy == Martin Luther King Jr.'s final Sunday sermon, delivered six days before his April 1968 assassination, explicitly references the thesis of "The Triple Revolution": There can be no gainsaying of the fact that a great revolution is taking place in the world today. In a sense it is a triple revolution: that is, a technological revolution, with the impact of automation and cybernation; then there is a revolution in weaponry, with the emergence of atomic and nuclear weapons of warfare; then there is a human rights revolution, with the freedom explosion that is taking place all over the world. Yes, we do live in a period where changes are taking place. And there is still the voice crying through the vista of time saying, "Behold, I make all things new; former things are passed away." In Harlan Ellison's 1967 anthology Dangerous Visions, Philip José Farmer's story "Riders of the Purple Wage" uses the Triple Revolution document as the premise of a future society, in which the "purple wage" of the title is a guaranteed income dole on which most of the population lives. At the 1968 World Science Fiction Convention in San Francisco, Farmer delivered a lengthy Guest of Honor speech in which he called for the founding of a grassroots activist organization called REAP which would work for implementation of the Ad Hoc Committee's recommendations. Looking back on the proposal in his 2008 book, Daniel Bell wrote: "the cybernetic revolution quickly proved to be illusory. There were no spectacular jumps in productivity. ... Cybernation had proved to be one more instance of the penchant for overdramatizing a momentary innovation and blowing it up far out of proportion to its actuality. ... The image of a completely automated production economy—with an endless capacity to turn out goods—was simply a social-science fiction of the early 1960s. Paradoxically, the vision of Utopia was suddenly replaced by the spectre of Doomsday. In place of the early-sixties theme of endless plenty, the picture by the end of the decade was one of a fragile planet of limited resources whose finite stocks were being rapidly depleted, and whose wastes from soaring industrial production were polluting the air and waters." In his 2015 book Rise of the Robots, Martin Ford claims The Triple Revolution's predictions of steady decline in future employment were not wrong, but rather premature. He cites "Seven Deadly Trends" that began in the 1970s-1980s and by the mid-2010s appeared set to continue: Stagnation in real wages Decline in labor's share of national income in many countries (breakdown of Bowley's law), while corporate profits increased Declining labor force participation Diminishing job creation, lengthening jobless recoveries, and soaring long-term unemployment Rising inequality Declining incomes, and underemployment for recent college graduates Polarization and part-time jobs (middle-class jobs are disappearing, to be replaced by a small number of high-paying jobs and large number of low-paying jobs) According to Ford, the 1960s were part of what in retrospect seems like a golden age for labor in the United States, when productivity and wages rose together in near lockstep, and unemployment was low. But after about 1980, wages began stagnating while productivity continued to rise. Labor's share of the economic output began to decline. Ford describes the role that automation and information technology play in these trends, and how new technologies including narrow AI threaten to destroy jobs faster than displaced workers can be retrained for new jobs, before automation takes the new jobs as well. This includes many job categories, such as in transportation, that were never threatened by automation before. According to a 2013 study, about 47% of US jobs are susceptible to automation. == Signatories ==

    Read more →
  • Period-tracking app

    Period-tracking app

    Period-tracking apps are mobile applications used to track the menstrual cycle. They may be used to predict menstruation, to plan fertility, and to track health. Examples include Clue, Glow, and Flo. == Function == Users enter their dates of menstruation, and frequently other experiences such as vaginal discharge and spotting; premenstrual syndrome; changes in mood; menstrual cramps and other pain; and other symptoms such as appetite changes, bloating, and acne. The apps predict the date of users' next period, and often also their ovulation and fertile window. Some apps have additional features such as contraceptive reminders, educational content, tracking modes for use during pregnancy, or the ability to share one's menstrual cycle data with a partner. == Privacy == Period-tracking apps collect personal health data, potentially raising concerns about privacy. Researchers have warned that data may be transferred to third parties and used for consumer profiling and targeted advertising, used for employment and health insurance discrimination, or used to prosecute users for seeking abortions. After the 2022 decision by the United States Supreme Court to overturn Roe v. Wade, and the bans and restrictions on abortion in many US states that followed, many American women uninstalled the apps amidst fear that the data could be accessed by law enforcement and used to prosecute users. WIRED published a ranking of several period-tracking apps by data privacy.

    Read more →
  • Time-compressed speech

    Time-compressed speech

    Time-compressed speech refers to an audio recording of verbal text in which the text is presented in a much shorter time interval than it would through normally-paced real time speech. The basic purpose is to make recorded speech contain more words in a given time, yet still be understandable. For example: a paragraph that might normally be expected to take 20 seconds to read, might instead be presented in 15 seconds, which would represent a time-compression of 25% (5 seconds out of 20). The term "time-compressed speech" should not be confused with "speech compression", which controls the volume range of a sound, but does not alter its time envelope. == Methods == While some voice talents are capable of speaking at rates significantly in excess of general norms, the term "time-compressed speech" most usually refers to examples in which the time-reduction has been accomplished through some form of electronic processing of the recorded speech. In general, recorded speech can be electronically time-compressed by: increasing its speed (linear compression); removing silences (selective editing); a combination of the two (non-linear compression). The speed of a recording can be increased, which will cause the material to be presented at a faster rate (and hence in a shorter amount of time), but this has the undesirable side-effect of increasing the frequency of the whole passage, raising the pitch of the voices, which can reduce intelligibility. There are normally silences between words and sentences, and even small silences within certain words, both of which can be reduced or removed ("edited-out") which will also reduce the amount of time occupied by the full speech recording. However, this can also have the effect of removing verbal "punctuation" from the speech, causing words and sentences to run together unnaturally, again reducing intelligibility. Vowels are typically held a minimum of 20 milliseconds, over many cycles of the fundamental pitch. DSP systems can detect the beginning and end of each cycle and then skip over some fraction of those cycles, causing the material to be presented at a faster rate, without changing the pitch, maintaining a "normal" tone of voice. The current preferred method of time-compression is called "non-linear compression", which employs a combination of selectively removing silences; speeding up the speech to make the reduced silences sound normally-proportioned to the text; and finally applying various data algorithms to bring the speech back down to the proper pitch. This produces a more acceptable result than either of the two earlier techniques; however, if unrestrained, removing the silences and increasing the speed can make a selection of speech sound more insistent, possibly to the point of unpleasantness. == Applications == === Advertising === Time-compressed speech is frequently used in television and radio advertising. The advantage of time-compressed speech is that the same number of words can be compressed into a smaller amount of time, reducing advertising costs, and/or allowing more information to be included in a given radio or TV advertisement. It is usually most noticeable in the information-dense caveats and disclaimers presented (usually by legal requirement) at the end of commercials—the aural equivalent of the "fine print" in a printed contract. This practice, however, is not new: before electronic methods were developed, spokespeople who could talk extremely quickly and still be understood were widely used as voice talents for radio and TV advertisements, and especially for recording such disclaimers. === Education === Time-compressed speech has educational applications such as increasing the information density of trainings, and as a study aid. A number of studies have demonstrated that the average person is capable of relatively easily comprehending speech delivered at higher-than-normal rates, with the peak occurring at around 25% compression (that is, 25% faster than normal); this facility has been demonstrated in several languages. Conversational speech (in English) takes place at a rate of around 150 wpm (words per minute), but the average person is able to comprehend speech presented at rates of up to 200-250 wpm without undue difficulty. Blind and severely visually impaired subjects scored similar comprehension levels at even higher rates, up to 300-350 wpm. Blind people have been found to use time-compressed speech extensively, for example, when reviewing recorded lectures from high school and college classes, or professional trainings. Comprehension rates in older blind subjects have been found to be as good, or in some cases better than those found in younger sighted subjects. Other studies have determined that the ability to comprehend highly time-compressed speech tends to fall off with increased age, and is also reduced when the language of the time-compressed speech is not the listener's native language. Non-native speakers can, however, improve their comprehension level of time-compressed speech with multiday training. === Voice Mail === Voice mail systems have employed time-compressed speech since as far back as the 1970s. In this application, the technology enables the rapid review of messages in high-traffic systems, by a relatively small number of people. === Streaming Multimedia === Time-compressed speech has been explored as one of a variety of interrelated factors which may be manipulated to increase the efficiency of streaming multimedia presentations, by significantly reducing the latency times involved in the transfer of large digitally encoded media files.

    Read more →
  • Rapid prototyping

    Rapid prototyping

    Rapid prototyping is a group of techniques used to quickly fabricate a scale model of a physical part or assembly using three-dimensional computer aided design (CAD) data. Construction of the part or assembly is usually done using 3D printing or "additive layer manufacturing" technology. The first methods for rapid prototyping became available in mid 1987 and were used to produce models and prototype parts. Today, they are used for a wide range of applications and are used to manufacture production-quality parts in relatively small numbers if desired without the typical unfavorable short-run economics. This economy has encouraged online service bureaus. Historical surveys of RP technology start with discussions of simulacra production techniques used by 19th-century sculptors. Some modern sculptors use the progeny technology to produce exhibitions and various objects. The ability to reproduce designs from a dataset has given rise to issues of rights, as it is now possible to interpolate volumetric data from 2D images. As with CNC subtractive methods, the computer-aided-design – computer-aided manufacturing CAD -CAM workflow in the traditional rapid prototyping process starts with the creation of geometric data, either as a 3D solid using a CAD workstation, or 2D slices using a scanning device. For rapid prototyping this data must represent a valid geometric model; namely, one whose boundary surfaces enclose a finite volume, contain no holes exposing the interior, and do not fold back on themselves. In other words, the object must have an "inside". The model is valid if for each point in 3D space the computer can determine uniquely whether that point lies inside, on, or outside the boundary surface of the model. CAD post-processors will approximate the application vendors' internal CAD geometric forms (e.g., B-splines) with a simplified mathematical form, which in turn is expressed in a specified data format which is a common feature in additive manufacturing: STL file format, a de facto standard for transferring solid geometric models to SFF machines. To obtain the necessary motion control trajectories to drive the actual SFF, rapid prototyping, 3D printing or additive manufacturing mechanism, the prepared geometric model is typically sliced into layers, and the slices are scanned into lines (producing a "2D drawing" used to generate trajectory as in CNC's toolpath), mimicking in reverse the layer-to-layer physical building process. == Application areas == Rapid prototyping is also commonly applied in software engineering to try out new business models and application architectures such as Aerospace, Automotive, Financial Services, Product development, and Healthcare. Aerospace design and industrial teams rely on prototyping in order to create new AM methodologies in the industry. Using SLA they can quickly make multiple versions of their projects in a few days and begin testing quicker. Rapid Prototyping allows designers/developers to provide an accurate idea of how the finished product will turn out before putting too much time and money into the prototype. 3D printing being used for Rapid Prototyping allows for Industrial 3D printing to take place. With this, you could have large-scale moulds to spare parts being pumped out quickly within a short period of time. == Types of Rapid Prototyping == Stereolithography (SLA) → a laser-cured photopolymer for materials such as thermoplastic-like photopolymers. Selective Laser Sintering (SLS) → a laser-sintered powder for materials such as Nylon or TPU. Direct Metal Laser Sintering (DMLS) → laser-sintered metal powder for materials like stainless steel, titanium, chrome, and aluminum. Fused Deposition Modeling (FDM) → fused extrusions of filaments like ABS, PC, and PPCU. Multi Jet Fusion (MJF) → it is an inkjet array selective fusing across bed of nylon powder for Black Nylon 12. PolyJet (PJET) → it is a uv-cured jetted photopolymer to work with acrylic-based and elastomeric photopolymers. Computer Numerical Controlled Machine (CNC) → it is used for manipulating engineering-grade thermoplastics and metals. Injection Molding (IM) → the injection is done using aluminum molds and it is used for thermoplastics, metals and liquid silicone rubber. Vacuum Casting→ is a manufacturing process used to create high-quality prototypes and small batches of parts. == History == In the 1970s, Joseph Henry Condon and others at Bell Labs developed the Unix Circuit Design System (UCDS), automating the laborious and error-prone task of manually converting drawings to fabricate circuit boards for the purposes of research and development. By the 1980s, U.S. policy makers and industrial managers were forced to take note that America's dominance in the field of machine tool manufacturing evaporated, in what was named the machine tool crisis. Numerous projects sought to counter these trends in the traditional CNC CAM area, which had begun in the US. Later when Rapid Prototyping Systems moved out of labs to be commercialized, it was recognized that developments were already international and U.S. rapid prototyping companies would not have the luxury of letting a lead slip away. The National Science Foundation was an umbrella for the National Aeronautics and Space Administration (NASA), the US Department of Energy, the US Department of Commerce NIST, the US Department of Defense, Defense Advanced Research Projects Agency (DARPA), and the Office of Naval Research coordinated studies to inform strategic planners in their deliberations. One such report was the 1997 Rapid Prototyping in Europe and Japan Panel Report in which Joseph J. Beaman founder of DTM Corporation [DTM RapidTool pictured] provides a historical perspective: The roots of rapid prototyping technology can be traced to practices in topography and photosculpture. Within TOPOGRAPHY Blanther (1892) suggested a layered method for making a mold for raised relief paper topographical maps .The process involved cutting the contour lines on a series of plates which were then stacked. Matsubara (1974) of Mitsubishi proposed a topographical process with a photo-hardening photopolymer resin to form thin layers stacked to make a casting mold. PHOTOSCULPTURE was a 19th-century technique to create exact three-dimensional replicas of objects. Most famously Francois Willeme (1860) placed 24 cameras in a circular array and simultaneously photographed an object. The silhouette of each photograph was then used to carve a replica. Morioka (1935, 1944) developed a hybrid photo sculpture and topographic process using structured light to photographically create contour lines of an object. The lines could then be developed into sheets and cut and stacked, or projected onto stock material for carving. The Munz (1956) Process reproduced a three-dimensional image of an object by selectively exposing, layer by layer, a photo emulsion on a lowering piston. After fixing, a solid transparent cylinder contains an image of the object. "The Origins of Rapid Prototyping - RP stems from the ever-growing CAD industry, more specifically, the solid modeling side of CAD. Before solid modeling was introduced in the late 1980's, three-dimensional models were created with wire frames and surfaces. But not until the development of true solid modeling could innovative processes such as RP be developed. Charles Hull, who helped found 3D Systems in 1986, developed the first RP process. This process, called stereolithography, builds objects by curing thin consecutive layers of certain ultraviolet light-sensitive liquid resins with a low-power laser. With the introduction of RP, CAD solid models could suddenly come to life". The technologies referred to as Solid Freeform Fabrication are what we recognize today as rapid prototyping, 3D printing or additive manufacturing: Swainson (1977), Schwerzel (1984) worked on polymerization of a photosensitive polymer at the intersection of two computer controlled laser beams. Ciraud (1972) considered magnetostatic or electrostatic deposition with electron beam, laser or plasma for sintered surface cladding. These were all proposed but it is unknown if working machines were built. Hideo Kodama of Nagoya Municipal Industrial Research Institute was the first to publish an account of a solid model fabricated using a photopolymer rapid prototyping system (1981). The first 3D rapid prototyping system relying on Fused Deposition Modeling (FDM) was made in April 1992 by Stratasys but the patent did not issue until June 9, 1992. Sanders Prototype, Inc introduced the first desktop inkjet 3D Printer (3DP) using an invention from August 4, 1992 (Helinski), Modelmaker 6Pro in late 1993 and then the larger industrial 3D printer, Modelmaker 2, in 1997. Z-Corp using the MIT 3DP powder binding for Direct Shell Casting (DSP) invented 1993 was introduced to the market in 1995. Even at that early date the technology was seen as having a place in manufacturing practice. A low resol

    Read more →