Language technology

Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. Working with language technology often requires broad knowledge not only about linguistics but also about computer science. It consists of natural language processing (NLP) and computational linguistics (CL) on the one hand, many application oriented aspects of these, and more low-level aspects such as encoding and speech technology on the other hand. Note that these elementary aspects are normally not considered to be within the scope of related terms such as natural language processing and (applied) computational linguistics, which are otherwise near-synonyms. As an example, for many of the world's lesser known languages, the foundation of language technology is providing communities with fonts and keyboard setups so their languages can be written on computers or mobile devices. Other tools also are part of modern language technology and include machine translation, speech recognition, text processing and natural language processing. Large scale AI models have recently advanced the field and enhanced the ability of machines to interpret complex human context.

Colloquis

Colloquis, previously known as ActiveBuddy and Conversagent, was a company that created conversation-based interactive agents originally distributed via instant messaging platforms. The company had offices in New York, New York, and Sunnyvale, California. == History == Founded in 2000, the company was the brainchild of Robert Hoffer, Timothy Kay, and Peter Levitan. The idea for interactive agents (also known as Internet bots) came from the team's vision to add functionality to increasingly popular instant messaging services. The original implementation took shape as a word-based adventure game but quickly grew to include a wide range of database applications, including access to news, weather, stock information, movie times, Yellow Pages listings, and detailed sports data, as well as a variety of tools (calculators, translator, etc.). These various applications were bundled into one entity and launched as SmarterChild in 2001. SmarterChild acted as a showcase for the quick data access and possibilities for fun conversation that the company planned to turn into customized, niche-specific products. The rapid success of SmarterChild led to targeted promotional products for Radiohead, Austin Powers, The Sporting News, and others. ActiveBuddy sought to strengthen its hold on the interactive agent market for the future by filing for, and receiving, a controversial patent on their creation in 2002. The company also released the BuddyScript SDK, a free developer kit that allow programmers to design and launch their own interactive agents using ActiveBuddy's proprietary scripting language, in 2002. Ultimately, however, the decline in ad spending in 2001 and 2002 led to a shift in corporate strategy towards business focused Automated Service Agents, building products for clients including Cingular, Comcast and Cox Communications. The company subsequently changed its name from ActiveBuddy to Conversagent in 2003, and then again to Colloquis in 2006. Colloquis was purchased by Microsoft in October 2006.

Randomized benchmarking

Randomized benchmarking is an experimental method for measuring the average error rates of quantum computing hardware platforms. The protocol estimates the average error rates by implementing long sequences of randomly sampled quantum gate operations. Randomized benchmarking is the industry-standard protocol used by quantum hardware developers such as IBM and Google to test the performance of the quantum operations. The original theory of randomized benchmarking, proposed by Joseph Emerson and collaborators, considered the implementation of sequences of Haar-random operations, but this had several practical limitations. The now-standard protocol for randomized benchmarking (RB) relies on uniformly random Clifford operations, as proposed in 2006 by Dankert et al. as an application of the theory of unitary t-designs. In current usage randomized benchmarking sometimes refers to the broader family of generalizations of the 2005 protocol involving different random gate sets that can identify various features of the strength and type of errors affecting the elementary quantum gate operations. Randomized benchmarking protocols are an important means of verifying and validating quantum operations and are also routinely used for the optimization of quantum control procedures. == Overview == Randomized benchmarking offers several key advantages over alternative approaches to error characterization. For example, the number of experimental procedures required for full characterization of errors (called tomography) grows exponentially with the number of quantum bits (called qubits). This makes tomographic methods impractical for even small systems of just 3 or 4 qubits. In contrast, randomized benchmarking protocols are the only known approaches to error characterization that scale efficiently as number of qubits in the system increases. Thus RB can be applied in practice to characterize errors in arbitrarily large quantum processors. Additionally, in experimental quantum computing, procedures for state preparation and measurement (SPAM) are also error-prone, and thus quantum process tomography is unable to distinguish errors associated with gate operations from errors associated with SPAM. In contrast, RB protocols are robust to state-preparation and measurement errors Randomized benchmarking protocols estimate key features of the errors that affect a set of quantum operations by examining how the observed fidelity of the final quantum state decreases as the length of the random sequence increases. If the set of operations satisfies certain mathematical properties, such as comprising a sequence of twirls with unitary two-designs, then the measured decay can be shown to be an invariant exponential with a rate fixed uniquely by features of the error model. == History == Randomized benchmarking was proposed in Scalable noise estimation with random unitary operators, where it was shown that long sequences of quantum gates sampled uniformly at random from the Haar measure on the group SU(d) would lead to an exponential decay at a rate that was uniquely fixed by the error model. Emerson, Alicki and Zyczkowski also showed, under the assumption of gate-independent errors, that the measured decay rate is directly related to an important figure of merit, the average gate fidelity and independent of the choice of initial state and any errors in the initial state, as well as the specific random sequences of quantum gates. This protocol applied for arbitrary dimension d and an arbitrary number n of qubits, where d=2n. The SU(d) RB protocol had two important limitations that were overcome in a modified protocol proposed by Dankert et al., who proposed sampling the gate operations uniformly at random from any unitary two-design, such as the Clifford group. They proved that this would produce the same exponential decay rate as the random SU(d) version of the protocol proposed in Emerson et al.. This follows from the observation that a random sequence of gates is equivalent to an independent sequence of twirls under that group, as conjectured in and later proven in. This Clifford-group approach to Randomized Benchmarking is the now standard method for assessing error rates in quantum computers. A variation of this protocol was proposed by NIST in 2008 for the first experimental implementation of an RB-type for single qubit gates. However, the sampling of random gates in the NIST protocol was later proven not to reproduce any unitary two-design. The NIST RB protocol was later shown to also produce an exponential fidelity decay, albeit with a rate that depends on non-invariant features of the error model In recent years a rigorous theoretical framework has been developed for Clifford-group RB protocols to show that they work reliably under very broad experimental conditions. In 2011 and 2012, Magesan et al. proved that the exponential decay rate is fully robust to arbitrary state preparation and measurement errors (SPAM). They also proved a connection between the average gate fidelity and diamond norm metric of error that is relevant to the fault-tolerant threshold. They also provided evidence that the observed decay was exponential and related to the average gate fidelity even if the error model varied across the gate operations, so-called gate-dependent errors, which is the experimentally realistic situation. In 2018, Wallman and Dugas et al., showed that, despite concerns raised in, even under very strong gate-dependence errors the standard RB protocols produces an exponential decay at a rate that precisely measures the average gate-fidelity of the experimentally relevant errors. The results of Wallman. in particular proved that the RB error rate is so robust to gate-dependent errors models that it provides an extremely sensitive tool for detecting non-Markovian errors. This follows because under a standard RB experiment only non-Markovian errors (including time-dependent Markovian errors) can produce a statistically significant deviation from an exponential decay The standard RB protocol was first implemented for single qubit gate operations in 2012 at Yale on a superconducting qubit. A variation of this standard protocol that is only defined for single qubit operations was implemented by NIST in 2008 on a trapped ion. The first implementation of the standard RB protocol for two-qubit gates was performed in 2012 at NIST for a system of two trapped ions

Digital media service

A digital media service (DMS) is an online service provider that sells access to digital library of content such as films, software, games, images, literature, etc. While no transfer of property is made, a nearly perfect duplicate of the data (song movie, etc.) is made on a customer's computer. Content is either primarily hosted on a dedicated server, which is owned by the service provider, or it is hosted primarily on the hard drives of its customers using a P2P protocol with, perhaps, a dedicated server to supplement. == History == One example of the older business model is the iTunes Store, which still markets and prices data as individual retail products. There are no examples of the latter business model in operation yet, but one is currently in development by Global Gaming Factory X and expected to begin operation some time after they acquire The Pirate Bay domain on August 27, 2009. A key difference between the two models is that the model which relies on its customer base for offering their bandwidth for other customers to access customer hosted data can operate at significantly lower costs than a company that seeks to limit data access to a per-download fee in order to supplement the cost of using its own hosting and bandwidth. The P2P model holds the potential for companies to offer unlimited access to the largest data library in the history of the internet to its customers for a reasonably low membership rate that is relevant to the cost of operation. While the market is virtually untouched, the P2P supplemented model will need entrepreneurs who are able to overcome a series of challenges in order to compete with the older business model as well as that which is offered for free (and often against the wishes of copyright holders) by hundreds of P2P communities on the internet. These challenges include, but are not limited to: Offering better data quality, speed, convenience and ease of use, protocol, sense of security, indexing and search organization, site up time, data library size, customer support, advertising, artist/copyright holder incentives and compensation, incentives and compensation for customers hosting data and providing bandwidth, guaranteed seeding (available access to indexed data at all times), than competitors.

Digital first

Digital first is a communication theory that publishers should release content into new media channels in preference to old media. The premise behind the theory is that after the advent of Internet, most established media organizations continued to give priority to traditional media. Over time, those organizations faced a choice to either publish first in digital media or traditional media. A "digital first" decision occurs when a publisher chooses to distribute information online in preference to or at the expense of traditional media like print publishing. Many employers and employees find it challenging to imagine using digital first practices. Distributing content digital first introduces new practices, including a need to manage the data which tracks readership. Many paper print publishers feel intimidated by the idea of publishing content online before publishing it in paper media. Comedian John Oliver in the show Last Week Tonight criticized digital first practices as a cause of lower standards in journalism. == Digital-First Transformation in Business and Education == The classical perspective of an information system is that it represents and reflects physical reality. However, it is increasingly evident that digital technologies not only represent reality but also actively shape it, as, in many instances, the digital version is created first, and the physical version follows. Gradually, digital infrastructures are integrated in people's work and life, shaping a digital environment through technologies such as 5G, sensors, and blockchain. The Digital First Framework, developed by Professor Youngjin Yoo, is a conceptual approach that helps the physical companies in the integration of digital technologies into the core of product and service design. The shift from traditional cars, where the physical vehicle precedes its digital representation on Google maps, to autonomous vehicles, where the digital representation (the blue dot) is created first, emphasizes the digital-first mindset in the design and operation of systems. In today's business environment, it's critical for organizations to embrace a digital-first strategy. Companies built on digital platforms will significantly diverge from traditional, hierarchical business structures that typically focus on a single product or market. These digitally-centered enterprises will offer products and services that are tailored to individual requirements, utilizing algorithms to assess needs based on specific situations, and relying on external partners to provide these solutions. This highlights the need to transform traditional R&D practices. It's essential for R&D teams to move beyond their laboratories and immerse themselves in the environments of their users. Understanding the context of use is fundamental to creating a relevant platform. As an illustration, the concept of Digital-first, as defined by Rohm et al. (2019), involves the integration of digital projects within educational courses, exemplified by institutions like M-School. The program adopts a programmatic approach, where successive courses progressively build upon one another, adopting an all-encompassing perspective that regards all aspects of marketing as inherently digital. Students actively participate in real-world projects, including campaigns for community improvement, and are tasked with generating content for diverse platforms. Through hands-on collaboration with live clients and the utilization of tools such as Google AdWords and Facebook Advertising, students acquire practical experience in the realms of digital marketing and analytics. == vBook == A vBook is an eBook that is digital first media with embedded video, images, graphs, tables, text, and other media.

Text Retrieval Conference

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity (part of the office of the Director of National Intelligence), and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology. TREC's evaluation protocols have improved many search technologies. A 2010 study estimated that "without TREC, U.S. Internet users would have spent up to 3.15 billion additional hours using web search engines between 1999 and 2009." Hal Varian the Chief Economist at Google wrote that "The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field." Each track has a challenge wherein NIST provides participating groups with data sets and test problems. Depending on track, test problems might be questions, topics, or target extractable features. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work.Text Retrieval Conference started in 1992, funded by DARPA (US Defense Advanced Research Project) and run by NIST. Its purpose was to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. == Goals == Encourage retrieval search based on large text collections Increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas Speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements retrieval methodologies on real world problems To increase the availability of appropriate evaluation techniques for use by industry and academia including development of new evaluation techniques more applicable to current systems TREC is overseen by a program committee consisting of representatives from government, industry, and academia. For each TREC, NIST provide a set of documents and questions. Participants run their own retrieval system on the data and return to NIST a list of retrieved top-ranked documents. NIST pools the individual result judges the retrieved documents for correctness and evaluates the results. The TREC cycle ends with a workshop that is a forum for participants to share their experiences. == Relevance judgments in TREC == TREC defines relevance as: "If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant." Most TREC retrieval tasks use binary relevance: a document is either relevant or not relevant. Some TREC tasks use graded relevance, capturing multiple degrees of relevance. Most TREC collections are too large to perform complete relevance assessment; for these collections it is impossible to calculate the absolute recall for each query. To decide which documents to assess, TREC usually uses a method call pooling. In this method, the top-ranked n documents from each contributing run are aggregated, and the resulting document set is judged completely. == Various TRECs == In 1992 TREC-1 was held at NIST. The first conference attracted 28 groups of researchers from academia and industry. It demonstrated a wide range of different approaches to the retrieval of text from large document collections .Finally TREC1 revealed the facts that automatic construction of queries from natural language query statements seems to work. Techniques based on natural language processing were no better no worse than those based on vector or probabilistic approach. TREC2 Took place in August 1993. 31 group of researchers participated in this. Two types of retrieval were examined. Retrieval using an ‘ad hoc’ query and retrieval using a ‘routing' query In TREC-3 a small group experiments worked with Spanish language collection and others dealt with interactive query formulation in multiple databases TREC-4 they made even shorter to investigate the problems with very short user statements TREC-5 includes both short and long versions of the topics with the goal of carrying out deeper investigation into which types of techniques work well on various lengths of topics In TREC-6 Three new tracks speech, cross language, high precision information retrieval were introduced. The goal of cross language information retrieval is to facilitate research on system that are able to retrieve relevant document regardless of language of the source document TREC-7 contained seven tracks out of which two were new Query track and very large corpus track. The goal of the query track was to create a large query collection TREC-8 contain seven tracks out of which two –question answering and web tracks were new. The objective of QA query is to explore the possibilities of providing answers to specific natural language queries TREC-9 Includes seven tracks In TREC-10 Video tracks introduced Video tracks design to promote research in content based retrieval from digital video In TREC-11 Novelty tracks introduced. The goal of novelty track is to investigate systems abilities to locate relevant and new information within the ranked set of documents returned by a traditional document retrieval system TREC-12 held in 2003 added three new tracks; Genome track, robust retrieval track, HARD (Highly Accurate Retrieval from Documents) == Tracks == === Current tracks === New tracks are added as new research needs are identified, this list is current for TREC 2018. CENTRE Track – Goal: run in parallel CLEF 2018, NTCIR-14, TREC 2018 to develop and tune an IR reproducibility evaluation protocol (new track for 2018). Common Core Track – Goal: an ad hoc search task over news documents. Complex Answer Retrieval (CAR) – Goal: to develop systems capable of answering complex information needs by collating information from an entire corpus. Incident Streams Track – Goal: to research technologies to automatically process social media streams during emergency situations (new track for TREC 2018). The News Track – Goal: partnership with The Washington Post to develop test collections in news environment (new for 2018). Precision Medicine Track – Goal: a specialization of the Clinical Decision Support track to focus on linking oncology patient data to clinical trials. Real-Time Summarization Track (RTS) – Goal: to explore techniques for real-time update summaries from social media streams. === Past tracks === Chemical Track – Goal: to develop and evaluate technology for large scale search in chemistry-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists. Clinical Decision Support Track – Goal: to investigate techniques for linking medical cases to information relevant for patient care Contextual Suggestion Track – Goal: to investigate search techniques for complex information needs that are highly dependent on context and user interests. Crowdsourcing Track – Goal: to provide a collaborative venue for exploring crowdsourcing methods both for evaluating search and for performing search tasks. Genomics Track – Goal: to study the retrieval of genomic data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007. Dynamic Domain Track – Goal: to investigate domain-specific search algorithms that adapt to the dynamic information needs of professional users as they explore in complex domains. Enterprise Track – Goal: to study search over the data of an organization to complete some task. Last ran on TREC 2008. Entity Track – Goal: to perform entity-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search. Cross-Language Track – Goal: to investigate the ability of retrieval systems to find documents topically regardless of source language. After 1999, this track spun off into CLEF. FedWeb Track – Goal: to select best resources to forward a query to, and merge the results so that most relevant are on the top. Federated Web Search Track – Goal: to investigate techniques for the selection and combination of search results from a large number of real on-line web search services. Filtering Track – Goal: to binarily decide retrieval of new

Facebook Platform

The Facebook Platform is the set of services, tools, and products provided by the social networking service Facebook for third-party developers to create their own applications and services that access data in Facebook. The current Facebook Platform was launched in 2010. The platform offers a set of programming interfaces and tools which enable developers to integrate with the open "social graph" of personal relations and other things like songs, places, and Facebook pages. Applications on facebook.com, external websites, and devices are all allowed to access the graph. == History == Facebook launched the Facebook Platform on May 24, 2007, providing a framework for software developers to create applications that interact with core Facebook features. A markup language called Facebook Markup Language was introduced simultaneously; it is used to customize the "look and feel" of applications that developers create. Prior to the Facebook platform, Facebook had built many applications themselves within the Facebook website, including Gifts, allowing users to send virtual gifts to each other, Marketplace, allowing users to post free classified ads, Facebook events, giving users a method of informing their friends about upcoming events, Video, letting users share homemade videos with one another, and social network game, where users can use their connections to friends to help them advance in games they are playing. The Facebook Platform made it possible for outside partners to build similar applications. Many of the popular early social network games would combine capabilities. For instance, one of the early games to reach the top application spot, (Lil) Green Patch, combined virtual Gifts with Event notifications to friends and contributions to charities through Causes. Third-party companies provide application metrics, and several blogs arose in response to the clamor for Facebook applications. On July 4, 2007, Altura Ventures announced the "Altura 1 Facebook Investment Fund," becoming the world's first Facebook-only venture capital firm. On August 29, 2007, Facebook changed the way in which the popularity of applications is measured, to give attention to the more engaging applications, following criticism that ranking applications only by the number of people who had installed the application was giving an advantage to the highly viral, yet useless applications. Tech blog Valleywag has criticized Facebook Applications, labeling them a "cornucopia of uselessness." Others have called for limiting third-party applications so the Facebook user experience is not degraded. Applications that have been created on the Platform include chess, which both allow users to play games with their friends. In such games, a user's moves are saved on the website, allowing the next move to be made at any time rather than immediately after the previous move. By November 3, 2007, seven thousand applications had been developed on the Facebook Platform, with another hundred created every day. By the second annual f8 developers conference on July 23, 2008, the number of applications had grown to 33,000, and the number of registered developers had exceeded 400,000. Within a few months of launching the Facebook Platform, issues arose regarding "application spam", which involves Facebook applications "spamming" users to request it be installed. Facebook integration was announced for the Xbox 360 and Nintendo DSi on June 1, 2009 at E3. On November 18, 2009, Sony announced an integration with Facebook to deliver the first phase of a variety of new features to further connect and enhance the online social experiences of PlayStation 3. On February 2, 2010, Facebook announced the release of HipHop for PHP as an opensource project. Mark Zuckerberg said that his team from Facebook is developing a Facebook search engine. “Facebook is pretty well placed to respond to people’s questions. At some point, we will. We have a team that is working on it", said Mark Zuckerberg. For him, the traditional search engines return too many results that do not necessarily respond to questions. “The search engines really need to evolve a set of answers: 'I have a specific question, answer this question for me.'" On June 10, 2014, Facebook announced Haxl, a Haskell library that simplified the access to remote data, such as databases or web-based services. === Partnerships with device manufacturers === Starting in 2007, Facebook formed data sharing partnerships with at least 60 handset manufacturers, including Apple, Amazon, BlackBerry, Microsoft and Samsung. Those manufacturers were provided with Facebook user data without the users' consent. Most of the partnerships remained in place as of 2018, when the partnerships were first publicly reported. == High-level Platform components == === Graph API === The Graph API is the core of Facebook Platform, enabling developers to read from and write data into Facebook. The Graph API presents a simple, consistent view of the Facebook social graph, uniformly representing objects in the graph (e.g., people, photos, events, and pages) and the connections between them (e.g., friend relationships, shared content, and photo tags). On April 30, 2015, Facebook shut down friends' data API prior to the v2.0 release. === Authentication === Facebook authentication enables developers’ applications to interact with the Graph API on behalf of Facebook users, and it provides a single-sign on mechanism across web, mobile, and desktop apps. ==== Facebook Connect ==== Facebook Connect, also called Log in with Facebook, like OpenID, is a set of authentication APIs from Facebook that developers can use to help their users connect and share with such users' Facebook friends (on and off Facebook) and increase engagement for their website or application. When so used, Facebook members can log on to third-party websites, applications, mobile devices and gaming systems with their Facebook identity and, while logged in, can connect with friends via these media and post information and updates to their Facebook profile. Originally unveiled during Facebook's developer conference, F8, in July 2008, Log in with Facebook became generally available in December 2008. According to an article from The New York Times, "Some say the services are representative of surprising new thinking in Silicon Valley. Instead of trying to hoard information about their users, the Internet companies (including Facebook, Google, MySpace and Twitter) all share at least some of that data so people do not have to enter the same identifying information again and again on different sites." Log in with Facebook cannot be used by users in locations that cannot access Facebook, even if the third-party site is otherwise accessible from that location. According to Facebook, users who logged into The Huffington Post with Facebook spent more time on the site than the average user. === Social plugins === Social plugins – including the Like Button, Recommendations, and Activity Feed – enable developers to provide social experiences to their users with just a few lines of HTML. All social plugins are extensions of Facebook and are designed so that no user data is shared with the sites on which they appear. On the other hand, the social plugins let Facebook track its users’ browsing habits through any sites that feature the plugins. === Open Graph protocol === The Open Graph protocol enables developers to integrate their pages into Facebook's global mapping/tracking tool Social Graph. These pages gain the functionality of other graph objects including profile links and stream updates for connected users. OpenGraph tags in HTML5 might look like this: === iframes === Facebook uses iframes to allow third-party developers to create applications that are hosted separately from Facebook, but operate within a Facebook session and are accessed through a user's profile. Since iframes essentially nest independent websites within a Facebook session, their content is distinct from Facebook formatting. Facebook originally used 'Facebook Markup Language (FBML)' to allow Facebook Application developers to customize the "look and feel" of their applications, to a limited extent. FBML is a specification of how to encode content so that Facebook's servers can read and publish it, which is needed in the Facebook-specific feed so that Facebook's system can properly parse content and publish it as specified. FBML set by any application is cached by Facebook until a subsequent API call replaces it. Facebook also offers a specialized Facebook JavaScript (FBJS) library. Facebook stopped accepting new FBML applications on March 18, 2011, but continued to support existing FBML tabs and applications. Since January 1, 2012 FBML was no longer supported, and FBML no longer functioned as of June 1, 2012. === Microformats === In February 2011, Facebook began to use the hCalendar microformat to mark up events, and the hCard for the events' venues,