Concept mining

Concept mining

Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial, but it can provide powerful insights into the meaning, provenance and similarity of documents. == Methods == Traditionally, the conversion of words to concepts has been performed using a thesaurus, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's WordNet. The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available machine translation systems cannot easily infer context. For the purposes of concept mining, however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining. There are many techniques for disambiguation that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity between the possible concepts and the context have appeared and gained interest in the scientific community. == Applications == === Detecting and indexing similar documents in large corpora === One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on hypernymy and meronymy. These structures can be used to generate simple tree membership statistics, that can be used to locate any document in a Euclidean concept space. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus. === Clustering documents by topic === Standard numeric clustering techniques may be used in "concept space" as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their text mining cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.

IPUMS

IPUMS, originally the Integrated Public Use Microdata Series, is the world's largest individual-level population database. IPUMS consists of microdata samples from United States (IPUMS-USA) and international (IPUMS-International) census records, as well as data from U.S. and international surveys. The records are converted into a consistent format and made available to researchers through a web-based data dissemination and analysis system. IPUMS is housed at the Institute for Social Research and Data Innovation (ISRDI), an interdisciplinary research center at the University of Minnesota, under the direction of Professor Steven Ruggles. == Description == IPUMS includes all persons enumerated in the United States censuses from 1850 to 1950 (though, the 1890 census is missing because it was destroyed in a fire) and from the American Community Survey since 2000 and the Current Population Survey since 1962. IPUMS includes household-level data for United States Censuses from 1790 to 1840, due to the first six censuses only including the name of the head of household, with tallied household totals following. IPUMS provides consistent variable names, coding schemes, and documentation across all the samples, facilitating the analysis of long-term change. IPUMS-International includes countries from Africa, Asia, Europe, and Latin America for 1960 forward. The database currently includes more than a billion individuals enumerated in 365 censuses from 94 countries around the world. IPUMS-International converts census microdata for multiple countries into a consistent format, allowing for comparisons across countries and time periods. Special efforts are made to simplify use of the data while losing no meaningful information. Comprehensive documentation is provided in a coherent form to facilitate comparative analyses of social and economic change. Additional databases in the IPUMS family include the: North Atlantic Population Project (NAPP) IPUMS National Historical Geographic Information System (NHGIS) IPUMS Health Surveys IPUMS Global Health IPUMS Time Use The Journal of American History described the effort as "One of the great archival projects of the past two decades." Liens Socio, the French portal for the social sciences, gave IPUMS the only “best site” designation that has gone to any non-French website, writing “IPUMS est un projet absolument extraordinaire...époustouflante [mind-blowing]!” The official motto of IPUMS is "use it for good, never for evil." All public IPUMS data and documentation are available online free of charge.

Moral Machine

Moral Machine is an online platform, developed by Iyad Rahwan's Scalable Cooperation group at the Massachusetts Institute of Technology, that generates moral dilemmas and collects information on the decisions that people make between two destructive outcomes. The platform is the idea of Iyad Rahwan and social psychologists Azim Shariff and Jean-François Bonnefon, who conceived of the idea ahead of the publication of their article about the ethics of self-driving cars. The key contributors to building the platform were MIT Media Lab graduate students Edmond Awad and Sohan Dsouza. The presented scenarios are often variations of the trolley problem, and the information collected would be used for further research regarding the decisions that machine intelligence must make in the future. For example, as artificial intelligence plays an increasingly significant role in autonomous driving technology, research projects like Moral Machine help to find solutions for challenging life-and-death decisions that will face self-driving vehicles. Moral Machine was active from January 2016 to July 2020. The Moral Machine continues to be available on their website for people to experience. == The experiment == The Moral Machine was an ambitious project; it was the first attempt at using such an experimental design to test a large number of humans in over 200 countries worldwide. The study was approved by the Institute Review Board (IRB) at Massachusetts Institute of Technology (MIT). The setup of the experiment asks the viewer to make a decision on a single scenario in which a self-driving car is about to hit pedestrians. The user can decide to have the car either swerve to avoid hitting the pedestrians or keep going straight to preserve the lives it is transporting. Participants can complete as many scenarios as they want to, however the scenarios themselves are generated in groups of thirteen. Within this thirteen, a single scenario is entirely random while the other twelve are generated from a space in a database of 26 million different possibilities. They are chosen with two dilemmas focused on each of six dimensions of moral preferences: character gender, character age, character physical fitness, character social status, character species, and character number. The experiment setup remains the same throughout multiple scenarios but each scenario tests a different set of factors. Most notably, the characters involved in the scenario are different in each one. Characters may include ones such as: Stroller, girl, boy, pregnant, Male Doctor, Female Doctor, Female Athlete, Executive Female, Male Athlete, Executive Male, Large Woman, Large Man, homeless, old man, old woman, dog, criminal, and a cat. Through these different characters researchers were able to understand how a wide variety of people will judge scenarios based on those involved. == Analysis == The Moral Machine collected 40 million moral decisions from 4 million participants in 233 countries, analysis of which revealed trends within individual countries and humanity as a whole. It tested for nine factors: preference for sparing humans versus pets, passengers versus pedestrians, men versus women, young versus elderly, fit versus overweight, higher versus lower social status, jaywalkers versus law abiders, larger versus smaller groups, and inaction (i.e. staying on course) versus swerving. Globally, participants favored human lives over lives of animals like dogs and cats. They preferred to spare more lives if possible, and younger lives as opposed to older. Babies were most often spared with cats being the least spared. In terms of gender variations, people tended to spare men over women for doctors and the elderly. All countries generally shared the preference to spare pedestrians over passengers and law-abiders over criminals. Participants from less wealthy countries showed a higher tendency of sparing pedestrians who crossed illegally compared to those from more wealthy and developed countries. This is most likely due to their experience living in a society where individuals are more likely to deviate from rules due to less stringent enforcement of laws. Countries of higher economic inequality overwhelmingly prefer to save wealthier individuals over poorer ones. === Cultural differences === Researchers subdivided 130 countries with similar results into three ‘cultural clusters’. North America and European countries with significant Christian populations had a higher preference for inaction on the part of the driver and thus had less of a preference for sparing pedestrians as compared to other clusters. East Asian and Islamic countries, together constituting the second cluster, did not have as much preference to spare younger humans compared to the other two clusters and had a higher preference for sparing law-abiding humans. Latin America and Francophone countries had a higher preference for sparing women, the young, the fit, and those of higher status, but a lower preference for sparing humans over pets or other animals. Individualistic cultures tended to spare larger groups, and collectivist cultures had a stronger preference for sparing the lives of older people. For instance, China ranked far below the world average for preference to spare the younger over elderly, while the average respondent from the US exhibited a much higher tendency to save younger lives and larger groups. == Applications of the data == The findings from the moral machine can help decision makers when designing self-driving automotive systems. Designers must make sure that these vehicles are able to solve problems on the road that aligns with the moral values of humans around it. This is a challenge because of the complex nature of humans who may all make different decisions based on their personal values. However, by collecting a large amount of decisions from humans all over the world, researchers can begin to understand patterns in the context of a particular culture, community, and people. == Other features == The Moral Machine was deployed in June 2016. In October 2016, a feature was added that offered users the option to fill a survey about their demographics, political views, and religious beliefs. Between November 2016 and March 2017, the website was progressively translated into nine languages in addition to English (Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, and Spanish). Overall, the Moral Machine offers four different modes, with the focus being on the data-gathering feature of the website, called the Judge mode. This means that the Moral Machine, in addition to providing their own scenarios for users to judge, also invites users to create their own scenarios to be submitted and approved so that other people may also judge those scenarios. Data is also open sourced for anyone to explore via an interactive map that is featured on the Moral Machine website. == In the literature == Studies and research on the Moral Machine have taken a wide variety of approaches. However, theological examinations of the topic are still scarce where two bodies of work that examine such perspective currently exist in this regard: One is Buddhist while the other is Christian.

ChipTest

ChipTest was a 1985 chess playing computer built by Feng-hsiung Hsu, Thomas Anantharaman and Murray Campbell at Carnegie Mellon University. It is the predecessor of Deep Thought which in turn evolved into Deep Blue. == History == ChipTest was based on a special VLSI-technology move generator chip developed by Hsu. ChipTest was controlled by a Sun-3/160 workstation and capable of searching approximately 50,000 moves per second. Hsu and Anantharaman entered ChipTest in the 1986 North American Computer Chess Championship, and it was only partially tested when the tournament began. It lost its first two rounds, but finished with an even score. In August 1987, ChipTest was overhauled and renamed ChipTest-M, M standing for microcode. The new version had eliminated ChipTest's bugs and was ten times faster, searching 500,000 moves per second and running on a Sun-4 workstation. ChipTest-M won the North American Computer Chess Championship in 1987 with a 4–0 sweep. ChipTest was invited to play in the 1987 American Open, but the team did not enter due to an objection by the HiTech team, also from Carnegie Mellon University. HiTech and ChipTest shared some code, and Hitech was already playing in the tournament. The two teams became rivals. Designing and implementing ChipTest revealed many possibilities for improvement, so the designers started on a new machine. Deep Thought 0.01 was created in May 1988 and the version 0.02 in November the same year. This new version had two customized VLSI chess processors and it was able to search 720,000 moves per second. With the "0.02" dropped from its name, Deep Thought won the World Computer Chess Championship with a perfect 5–0 score in 1989.

LipNet

LipNet is a deep neural network for audio-visual speech recognition (ASVR). It was created by University of Oxford researchers Yannis Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. The researchers stated that could match mouth movements to text with 93 percent accuracy, though it was criticized for its test using a limited dataset of words and grammar. It was used in Nvidia's autonomous "backseat driver" prototype Co-Pilot.

Time-compressed speech

Time-compressed speech refers to an audio recording of verbal text in which the text is presented in a much shorter time interval than it would through normally-paced real time speech. The basic purpose is to make recorded speech contain more words in a given time, yet still be understandable. For example: a paragraph that might normally be expected to take 20 seconds to read, might instead be presented in 15 seconds, which would represent a time-compression of 25% (5 seconds out of 20). The term "time-compressed speech" should not be confused with "speech compression", which controls the volume range of a sound, but does not alter its time envelope. == Methods == While some voice talents are capable of speaking at rates significantly in excess of general norms, the term "time-compressed speech" most usually refers to examples in which the time-reduction has been accomplished through some form of electronic processing of the recorded speech. In general, recorded speech can be electronically time-compressed by: increasing its speed (linear compression); removing silences (selective editing); a combination of the two (non-linear compression). The speed of a recording can be increased, which will cause the material to be presented at a faster rate (and hence in a shorter amount of time), but this has the undesirable side-effect of increasing the frequency of the whole passage, raising the pitch of the voices, which can reduce intelligibility. There are normally silences between words and sentences, and even small silences within certain words, both of which can be reduced or removed ("edited-out") which will also reduce the amount of time occupied by the full speech recording. However, this can also have the effect of removing verbal "punctuation" from the speech, causing words and sentences to run together unnaturally, again reducing intelligibility. Vowels are typically held a minimum of 20 milliseconds, over many cycles of the fundamental pitch. DSP systems can detect the beginning and end of each cycle and then skip over some fraction of those cycles, causing the material to be presented at a faster rate, without changing the pitch, maintaining a "normal" tone of voice. The current preferred method of time-compression is called "non-linear compression", which employs a combination of selectively removing silences; speeding up the speech to make the reduced silences sound normally-proportioned to the text; and finally applying various data algorithms to bring the speech back down to the proper pitch. This produces a more acceptable result than either of the two earlier techniques; however, if unrestrained, removing the silences and increasing the speed can make a selection of speech sound more insistent, possibly to the point of unpleasantness. == Applications == === Advertising === Time-compressed speech is frequently used in television and radio advertising. The advantage of time-compressed speech is that the same number of words can be compressed into a smaller amount of time, reducing advertising costs, and/or allowing more information to be included in a given radio or TV advertisement. It is usually most noticeable in the information-dense caveats and disclaimers presented (usually by legal requirement) at the end of commercials—the aural equivalent of the "fine print" in a printed contract. This practice, however, is not new: before electronic methods were developed, spokespeople who could talk extremely quickly and still be understood were widely used as voice talents for radio and TV advertisements, and especially for recording such disclaimers. === Education === Time-compressed speech has educational applications such as increasing the information density of trainings, and as a study aid. A number of studies have demonstrated that the average person is capable of relatively easily comprehending speech delivered at higher-than-normal rates, with the peak occurring at around 25% compression (that is, 25% faster than normal); this facility has been demonstrated in several languages. Conversational speech (in English) takes place at a rate of around 150 wpm (words per minute), but the average person is able to comprehend speech presented at rates of up to 200-250 wpm without undue difficulty. Blind and severely visually impaired subjects scored similar comprehension levels at even higher rates, up to 300-350 wpm. Blind people have been found to use time-compressed speech extensively, for example, when reviewing recorded lectures from high school and college classes, or professional trainings. Comprehension rates in older blind subjects have been found to be as good, or in some cases better than those found in younger sighted subjects. Other studies have determined that the ability to comprehend highly time-compressed speech tends to fall off with increased age, and is also reduced when the language of the time-compressed speech is not the listener's native language. Non-native speakers can, however, improve their comprehension level of time-compressed speech with multiday training. === Voice Mail === Voice mail systems have employed time-compressed speech since as far back as the 1970s. In this application, the technology enables the rapid review of messages in high-traffic systems, by a relatively small number of people. === Streaming Multimedia === Time-compressed speech has been explored as one of a variety of interrelated factors which may be manipulated to increase the efficiency of streaming multimedia presentations, by significantly reducing the latency times involved in the transfer of large digitally encoded media files.

AlphaStar (software)

AlphaStar is an artificial intelligence (AI) software developed by DeepMind for playing the video game StarCraft II. It was unveiled to the public by name in January 2019. AlphaStar attained "Grandmaster" status in August 2019, considered a milestone for AI in video games at the time. == Background == Games created for humans are considered to have external validity as benchmarks of progress in artificial intelligence. IBM's chess engine Deep Blue (1997) and DeepMind's AlphaGo (2016) were considered major milestones; some argue that StarCraft would also be a major milestone, due to the game's "real-time play, partial observability, no single dominant strategy, complex rules that make it hard to build a fast forward model, and a particularly large and varied action space." Though difficult, StarCraft may still be tractable with current technology because "its rules are known and the world is discrete with only a few types of objects". StarCraft II is a popular fast-paced online real-time strategy game developed by Blizzard Entertainment. == History == DeepMind Technologies was founded in the UK in 2010. As early as 2011, founder Demis Hassabis called StarCraft "the next step up" after games like Go. DeepMind became a subsidiary of Google in 2014, after demonstrating self-learning bots with superhuman ability at a variety of Atari 2600 games. In February 2015, computer scientist Zachary Mason predicted Deepmind's research "leads to StarCraft in five or ten years". In March 2016, following AlphaGo's victory over Lee Sedol, a world champion Go player, Hassabis publicly mulled building an AI for StarCraft, citing it as a strategic game with incomplete information where, unlike Go, much of the "board" is invisible. A formal collaboration was announced at BlizzCon in November 2016, alongside a plan to release an open development environment for bots in Q1 of 2017. By 2017, DeepMind was experimenting with feeding StarCraft data into its software. In August 2017, DeepMind and Blizzard released development tools to assist in bot development, as well as data from 65,000 historical games. At the time, computer scientist and StarCraft tournament manager David Churchill estimated it would take five years for a bot to beat a human, but made the caveat that AlphaGo had beaten expectations. In Wired, tech journalist Tom Simonite stated "No one expects the robot to win anytime soon. But when it does, it will be a far greater achievement than DeepMind's conquest of Go." In December 2018, DeepMind's bot defeated professional player Grzegorz "MaNa" Komincz, 5-0. DeepMind announced the bot, named "AlphaStar", in January 2019. A journalist at Ars Technica and others argued that AlphaStar still had unfair advantages: "AlphaStar has the ability to make its clicks with surgical precision using an API, whereas human players are constrained by the mechanical limits of computer mice". AlphaStar also had a global view rather than being limited by the in-game camera. Furthermore, while there was a cap on the number of actions over a five-second window, AlphaStar was free to allocate its action quota unevenly across the window in order to launch superhuman bursts of activity at critical moments. DeepMind quickly retrained AlphaStar under more realistic constraints, and then lost a rematch with Komincz. Starting in July 2019, the new, constrained version of AlphaStar anonymously competed against players who "opted in" on the public 1v1 European multiplayer ladder. By the end of August 2019, AlphaStar had attained Grandmaster level, ranking among the top 0.2% of human players. == Algorithms == Unlike AlphaZero, AlphaStar initially learns to imitate the moves of the best players in its database of human vs. human games; this step is necessary to solve what DeepMind's Dave Silver calls "the exploration problem": discovering new strategies would otherwise be like finding a "needle in a haystack". Agents then play each other and deploy deep reinforcement learning. These main agents also learn by playing against suboptimal "exploiter agents" whose purpose is to expose weaknesses in the main agents. == Reactions == After his 5-0 defeat in December 2018, Komincz stated "I wasn't expecting the AI to be that good". Stuart Russell assessed that AlphaStar's 2018 victory required "a fair amount of problem-specific effort" and that general-purpose methods were "not quite ready for StarCraft". An article in Wired UK judged AlphaStar's new constraints, adopted for the July 2019 matches, to be "fair" this time around. StarCraft professional Raza "RazerBlader" Sekha stated AlphaStar was "impressive" but had its quirks, succumbing in one game to an unorthodox army composition made up of only air units. The UK's top player, Joshua "RiSky" Hayward, expressed some disappointment, saying AlphaStar "often didn't make the most efficient, strategic decisions". Professional Diego "Kelazhur" Schwimer called AlphaStar's play "unimaginably unusual; it really makes you question how much of StarCraft's diverse possibilities pro players have really explored". AlphaStar's opponents often did not realize they were playing a bot. Ian Sample, of The Guardian, called AlphaStar a "landmark achievement" for the field of AI. Churchill stated that he had previously seen bots that master one or two elements of StarCraft, but that AlphaStar was the first that can handle the game in its entirety. Gary Marcus expressed his continuing skepticism about deep learning, stating: "So far the field has struggled to take techniques like this out of the laboratory and game environments and into the real world, and I don't immediately see this result as progress in that direction". AI researcher Jon Dodge was surprised by AlphaStar, stating that he did not expect such a "superhuman" performance for "another couple of years"; in contrast, Churchill states "StarCraft is nowhere near being 'solved', and AlphaStar is not yet even close to playing at a world champion level". == Legacy == DeepMind argues that insights from AlphaStar might benefit robots, self-driving cars, and virtual assistants, which need to operate with "imperfectly observed information". Silver has indicated his lab "may rest at this point", rather than try to substantially improve AlphaStar. Silver himself argues that "AlphaStar has become the first AI system to reach the top tier of human performance in any professionally played e-sport on the full unrestricted game under professionally approved conditions... Ever since computers cracked Go, chess, and poker, the game of StarCraft has emerged, essentially by consensus from the community, as the next grand challenge for AI." Computer scientist Noel Sharkey argues, disapprovingly, that "military analysts will certainly be eyeing the successful AlphaStar real-time strategies as a clear example of the advantages of AI for battlefield planning". In contrast, Silver argues: "To say that this has any kind of military use is saying no more than to say an AI for chess could be used to lead to military applications".