Google Research

Google Research

Google Research (also known as Research at Google) is the research division of Google, a subsidiary of Alphabet Inc.. According to its official website, Google Research publishes findings, releases open-source software, and applies research results within Google products and services as well as within the wider scientific community. == Notable contributions == The 2017 landmark paper Attention Is All You Need, which introduced the Transformer architecture, which has subsequently been used to build modern large language models. Advances in neural machine translation powering Google Translate. Time series forecasting. Development of scalable learning systems and infrastructure for large-model training. Flood forecasting. Research into computational discovery via Google Accelerated Science including demonstrating the first below-threshold quantum calculations.

History of natural language processing

The history of natural language processing describes the advances of natural language processing. There is some overlap with the history of machine translation, the history of speech recognition, and the history of artificial intelligence. == Early history == The history of machine translation dates back to the seventeenth century, when philosophers such as Leibniz and Descartes put forward proposals for codes which would relate words between languages. All of these proposals remained theoretical, and none resulted in the development of an actual machine. The first patents for "translating machines" were applied for in the mid-1930s. One proposal, by Georges Artsrouni, was simply an automatic bilingual dictionary using paper tape. The other proposal, by Peter Troyanskii, a Russian, was more detailed. Troyanskii’s proposal included both the bilingual dictionary and a method for dealing with grammatical roles between languages, based on Esperanto. == Logical period == In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably — on the basis of the conversational content alone — between the program and a real human. In 1957, Noam Chomsky’s Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule-based system of syntactic structures. The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, which found that ten years long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies. In 1969 Roger Schank introduced the conceptual dependency theory for natural language understanding. This model, partially influenced by the work of Sydney Lamb, was extensively used by Schank's students at Yale University, such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input. Instead of phrase structure rules ATNs used an equivalent set of finite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. During the 1970s many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky. == Statistical period == Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing. Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. === Datasets === The emergence of statistical approaches was aided by both increase in computing power and the availability of large datasets. At that time, large multilingual corpora were starting to emerge. Notably, some were produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. Many of the notable early successes occurred in the field of machine translation. In 1993, the IBM alignment models were used for statistical machine translation. Compared to previous machine translation systems, which were symbolic systems manually coded by computational linguists, these systems were statistical, which allowed them to automatically learn from large textual corpora. Though these systems do not work well in situations where only small corpora is available, so data-efficient methods continue to be an area of research and development. In 2001, a one-billion-word large text corpus, scraped from the Internet, referred to as "very very large" at the time, was used for word disambiguation. To take advantage of large, unlabelled datasets, algorithms were developed for unsupervised and self-supervised learning. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results. == Neural period == Neural language models were developed in 1990s. In 1990, the Elman network, using a recurrent neural network, encoded each word in a training set as a vector, called a word embedding, and the whole vocabulary as a vector database, allowing it to perform such tasks as sequence-predictions that are beyond the power of a simple multilayer perceptron. A shortcoming of the static embeddings was that they didn't differentiate between multiple meanings of homonyms. Yoshua Bengio developed the first neural probabilistic language model in 2000. Novel algorithms, availability of larger datasets and higher processing power made possible training of larger and larger language models. Attention mechanism was introduced by Bahdanau et al. in 2014. This work laid the foundations for the famous "Attention Is All You Need" paper that introduced the Transformer architecture in 2017. The concept of large language model (LLM) emerged in late 2010s. LLM is a language model trained with self-supervised learning on vast amount of text. Earliest public LLMs had hundreds of millions of parameters, but this number quickly rose to billion and even trillions. In recent years, advancements in deep learning and large language models have significantly enhanced the capabilities of natural language processing, leading to widespread applications in areas such as healthcare, customer service, and content generation. == Software ==

Pocket (service)

Pocket, formerly known as Read It Later, was a social bookmarking service for storing, sharing and discovering web bookmarks, first released in 2007. Mozilla, the developer of Pocket, announced in May 2025 that it was discontinuing the service and would shut it down in July of that year. == History == Pocket was introduced in August 2007 as a Mozilla Firefox browser extension named Read It Later by Nathan (Nate) Weiner. Once his product was used by millions of people, he moved his office to Silicon Valley and four other people joined the Read It Later team. Weiner's intention was for the application to be like a TiVo directory for web content and to give users access to that content on any device. Read It Later obtained venture capital investments of US$2.5 million in 2011 and $5.0 million in 2012. The 2011 funding came from Foundation Capital, Baseline Ventures, Google Ventures, Founder Collective and unnamed angel investors. The company rejected an acquisition offer by Evernote after showing concerns that Evernote intended to shut down the Read It Later service and amalgamate its functionality into Evernote's main service. Initially, the Read It Later app was available in a free version and a paid version that included additional features. After the rebranding to Pocket, all paid features were made available in a free and advertisement-free app. In May 2014, a paid subscription service called Pocket Premium was introduced, adding server-side storage of articles and more powerful search tools. In June 2015, Pocket was included in Firefox, via a toolbar button and link to a user's Pocket list in the bookmark's menu. The integration was controversial, as users displayed concerns for the direct integration of a proprietary service into an open source application, and that it could not be completely disabled without editing advanced settings, unlike other third-party extensions. A Mozilla spokesperson stated that the feature was meant to leverage the service's popularity among Firefox users and clarified that all code related to the integration was open source. The spokesperson added that "[Mozilla had] gotten lots of positive feedback about the integration from users". On February 27, 2017, Pocket announced that it had been acquired by Mozilla Corporation, the commercial arm of Firefox's non-profit development group. Mozilla staff stated that Pocket would continue to operate as an independent subsidiary but that it would be leveraged as part of an ongoing "Context Graph" project. There were plans to open-source the server-side code of Pocket, though only parts of the project had been open-sourced as of 2024. On May 22, 2025, Mozilla announced that it would shut down Pocket on July 8, 2025. Exports of user data would be available until October 8, 2025, when accounts would be deleted. The email newsletter Pocket Hits was rebranded as Ten Tabs on June 12 as part of the closure, with it being changed to release only on weekdays. == Functions == The application allows the user to save an article or web page to remote servers for later reading. The article is sent to the user's Pocket list (synced to all of their devices) for offline reading. Pocket makes the article more readable by removing clutter and enabling the user to add tags and adjust text settings. == User base == The application had 17 million users and 1 billion saves, as of September 2015. Pocket was listed among Time magazine's 50 Best Android Applications for 2013. == Reception == Kent German of CNET said that "Read It Later is oh so incredibly useful for saving all the articles and news stories I find while commuting or waiting in line." Erez Zukerman of PC World said that supporting the developer is enough reason to buy what he deemed a "handy app". Bill Barol of Forbes said that although Read It Later works less well than Instapaper, "it makes my beloved Instapaper look and feel a little stodgy." In 2015, Pocket was awarded a Material Design Award for Adaptive Layout by Google for their Android application.

List of publications in data science

This is a list of publications in data science, generally organized by order of use in a data analysis workflow. See the list of publications in statistics for more research-based and fundamental publications; while this list is more applied, business oriented, and cross-disciplinary. General article inclusion criteria are: Papers from notable practitioners or notable professors, either with a Wikipedia page or reference to their notability Common knowledge all data professionals should know, with references validating this claim Highly cited applied statistics and machine learning publications Discussion-facilitating papers on the field of data science as a whole (for example, the Attention Is All You Need paper is arguably a landmark paper that can be added here, but it is specific to generative artificial intelligence, not for all practitioners of data) Some reasons why a particular publication might be regarded as important: Topic creator – A publication that created a new topic Breakthrough – A publication that changed scientific knowledge significantly Influence – A publication which has significantly influenced the world or has had a massive impact on the teaching of data science. When possible, a reference is used to validate the inclusion of the publication in this list. == History == Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) Author: Leo Breiman Publication data: Online version: https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.pdf Description: Describes two cultures of statistics, one using a parsimonious and generative stochastic model, while the other is an algorithmic model with no known mechanism for how the data is generated. Breiman argues that while statistics has traditionally favored using the stochastic model, there is value in expanding the methods that statisticians can use to study phenomenon. Importance: Influence on the philosophies of statisticians right before the increased use of machine learning and deep learning methods. In a 20-year retrospective on this article, "Breiman's words are perhaps more relevant than ever". Notable statisticians at the time wrote opinion pieces about the publication. Although overall critical of the publication, David Cox writes that the publication "contains enough truth and exposes enough weaknesses to be thought-provoking." Bradley Efron commented that this publication is a "stimulating paper". Emanuel Parzen also comments about this publication that "Breiman alerts us to systematic blunders (leading to wrong conclusions) that have been committed applying current statistical practice of data modeling". Data Scientist: The Sexiest Job of the 21st Century Author: Thomas H. Davenport and DJ Patil Publication data: Online version: hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century Description: Describes the new role at companies that is coined "Data scientist", what they do, how an organization might recruit one to their organization, and how to work with one effectively. Importance: This publication has been an influence on the data community as mentioned near the time it was published in 2012 by institutions like IEEE Spectrum, but also mentioned nearly a decade later asking the same question the title poses. In a retrospective response to their own publication 10 years earlier, authors Davenport and Patil have reflected that the role of a data scientist has "become better institutionalized, the scope of the job has been redefined, the technology it relies on has made huge strides, and the importance of non-technical expertise, such as ethics and change management, has grown". 50 Years of Data Science Author: David Donoho Publication data: Online version: https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734 Description: Retrospective discussion paper on the history and origins of data science, with a number of commentary from notable statisticians. Importance: This has been described as "the first in the field to present such a comprehensive and in-depth survey and overview", and helps to define the field that has many definitions. The Composable Data Management System Manifesto Author: Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satya R Valluri, Mohamed Zait, Jacques Nadeau Publication data: Online version: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf Description: The vision paper advocating for a paradigm shift in how data management systems are designed using standard, composable, interoperable tools rather than siloed software tools. Importance: A paradigm shifting view on how future data science software tools should be designed for more efficient workflows, the principles of which "will be especially crucial for addressing fragmentation, improving interoperability, and promoting user-centricity as data ecosystems grow increasingly complex". == Data collection and organization == Tidy Data Author: Hadley Wickham Publication data: Online version: https://www.jstatsoft.org/article/view/v059i10/ https://vita.had.co.nz/papers/tidy-data.pdf Description: Describes a framework for data cleaning that is summarized in the quote, "each variable is a column, each observation is a row, and each type of observational unit is a table". This allows a standard data structure for which data analysis tools can be consistently built around. Importance: Cited over 1,500 times, this effort for tidy data has been described by David Donoho as having "more impact on today's practice of data analysis than many highly regarded theoretical statistics articles". In the context of data visualization, this publication is said to support "efficient exploration and prototyping because variables can be assigned different roles in the plot without modifying anything about the original dataset". Data Organization in Spreadsheets Author: Karl W. Broman and Kara H. Woo Publication data: Online version: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 Description: This article offers practical recommendations for organizing data in spreadsheets, like Microsoft Excel and Google Sheets, to reduce errors and lower the barrier for later analyses due to limitations in spreadsheets or quirks in the software. Importance: Influences teaching both data and non-data practitioners to create more analysis-friendly spreadsheets, and has been described to outline "spreadsheet best practices". == Data visualizations == Quantitative Graphics in Statistics: A Brief History Author: James R. Beniger and Dorothy L. Robyn Publication data: Online version: https://www.jstor.org/stable/2683467 Description: Outlines history and evolution of quantitative graphics in statistics, going through spatial organization (17th and 18th centuries), discrete comparison (18th and 19th centuries), continuous distribution (19th century), and multivariate distribution and correlation (late 19th and 20th centuries). Importance: Helps put into perspective for learning data practitioners the recency of graphics that are used. A later publication "Graphical Methods in Statistics" by Stephen Fienberg in 1979 writes that his publication "owes much to the work of Beniger and Robyn". == Practice == Data Science for Business Author: Foster Provost and Tom Fawcett Publication data: Online version: N/A Description: Broadly outlines principles of data science and data-analytic thinking for businesses. Importance: Cited over 3,000 times, it is "highly recommended for students" but also it is also recommended due to its "relevance to senior management leaders who want to build and lead a team of data scientists and implement data science in solving complex business problems". == Tooling == Hidden Technical Debt in Machine Learning Systems Author: D. Sculley, Gary Holy, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison Publication data: Online version: https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf Description: This paper argues that it is "dangerous to think of [complex machine learning] quick wins as coming for free" and overviews risk factors to account for when implementing a machine learning system. Importance: All authors worked for Google, article is cited over 2,000 times, and helped practitioners thinking about quickly implementing a machine learning tool without understanding the long-term maintenance of the tool. A few useful things to know about machine learning Author: Pedro Domingos Publication data: Online version: https://dl.acm.org/doi/10.1145/2347736.2347755 https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf Description: The purpose of this paper is to distill inaccessible "folk knowledge" to effectively implement machine learning projects because "machin

Dimensions CM

Dimensions CM is a software change and configuration management product developed by OpenText Corporation. It includes revision control, change, build and release management capabilities. Since 2014 (v14.1) Dimensions CM includes PulseUno module providing Code review and Continuous integration capabilities. Starting with the version 14.5.2 (2020) it can also serve as a binary repository manager. == History == Previous product names: PCMS Dimensions (SQL Software) PVCS Dimensions (Merant, Intersolv)

Automated negotiation

Automated negotiation is a form of interaction in systems that are composed of multiple autonomous agents, in which the aim is to reach agreements through an iterative process of making offers. Automated negotiation can be employed for many tasks human negotiators regularly engage in, such as bargaining and joint decision making. The main topics in automated negotiation revolve around the design of protocols and negotiating strategies. == History == Through digitization, the beginning of the 21st century has seen a growing interest in the automation of negotiation and e-negotiation systems, for example in the setting of e-commerce. This interest is fueled by the promise of automated agents being able to negotiate on behalf of human negotiators, and to find better outcomes than human negotiators. == Examples == Examples of automated negotiation include: Online dispute resolution, in which disagreements between parties are settled. Sponsored search auction, where bids are placed on advertisement keywords. Content negotiation, in which user agents negotiate over HTTP about how to best represent a web resource. Negotiation support systems, in which negotiation decision-making activities are supported by an information system.

Cygwin

Cygwin ( SIG-win) is a free and open-source Unix-like environment and command-line interface (CLI) for Microsoft Windows. The project also provides a software repository containing open-source packages. Cygwin allows source code for Unix-like operating systems to be compiled and run on Windows. Cygwin provides native integration of Windows-based applications. The terminal emulator mintty is the default command-line interface provided to interact with the environment. The Cygwin installation directory layout mimics the root file system of Unix-like systems, with directories such as /bin, /home, /etc, /usr, and /var. Cygwin is released under the GNU Lesser General Public License version 3. It was originally developed by Cygnus Solutions, which was later acquired by Red Hat (now part of IBM), to port the GNU toolchain to Win32, including the GNU Compiler Suite. Rather than rewrite the tools to use the Win32 runtime environment, Cygwin implemented a POSIX-compatible environment in the form of a DLL. The brand motto is "Get that Linux feeling – on Windows", although Cygwin doesn't have Linux in it. == History == Cygwin began in 1995 as a project of Steve Chamberlain, a Cygnus engineer who observed that Windows NT and 95 used COFF as their object file format, and that GNU already included support for x86 and COFF, and the C library newlib. He thought that it would be possible to retarget GCC and produce a cross compiler generating executables that could run on Windows. A prototype was later developed. Chamberlain bootstrapped the compiler on a Windows system, to emulate Unix to let the GNU configure shell script run. Initially, Cygwin was called Cygwin32. When Microsoft registered the trademark Win32, the "32" was dropped to simply become Cygwin. In 1999, Cygnus offered Cygwin 1.0 as a commercial product. Subsequent versions have not been released, instead relying on continued open source releases. Geoffrey Noer was the project lead from 1996 to 1999. Christopher Faylor was lead from 1999 to 2004; he left Red Hat and became co-lead with Corinna Vinschen. Corinna Vinschen has been the project lead from mid-2014 to date (as of September, 2024). From June 23, 2016, the Cygwin library version 2.5.2 was licensed under the GNU Lesser General Public License (LGPL) version 3. == Description == Cygwin is provided in two versions: the full 64-bit version and a stripped-down 32-bit version, whose final version was released in 2022. Cygwin consists of a library that implements the POSIX system call API in terms of Windows system calls to enable the running of a large number of application programs equivalent to those on Unix systems, and a GNU development toolchain (including GCC and GDB). Programmers have ported the X Window System, K Desktop Environment 3, GNOME, Apache, and TeX. Cygwin permits installing inetd, syslogd, sshd, Apache, and other daemons as standard Windows services. Cygwin programs have full access to the Windows API and other Windows libraries. Cygwin programs are installed by running Cygwin's "setup" program, which downloads them from repositories on the Internet. The Cygwin API library is licensed under the GNU Lesser General Public License version 3 (or later), with an exception to allow linking to any free and open-source software whose license conforms to the Open Source Definition. Cygwin consists of two parts: A dynamic-link library in the form of a C standard library that acts as a compatibility layer for the POSIX API and A collection of software tools and applications that provide a Unix-like look and feel. Cygwin supports POSIX symbolic links, representing them as plain-text files with the system attribute set. Cygwin 1.5 represented them as Windows Explorer shortcuts, but this was changed for reasons of performance and POSIX correctness. Cygwin also recognises NTFS junction points and symbolic links and treats them as POSIX symbolic links, but it does not create them. The POSIX API for handling access control lists (ACLs) is supported. === Technical details === A Cygwin-specific version of the Unix mount command allows mounting Windows paths as "filesystems" in the Unix file space. Initial mount points can be configured in /etc/fstab, which has a format very similar to Unix systems, except that Windows paths appear in place of devices. Filesystems can be mounted in binary mode (by default), or in text mode, which enables automatic conversion between LF and CRLF endings (which only affects programs that open files without explicitly specifying text or binary mode). Cygwin 1.7 introduced comprehensive support for POSIX locales, and the UTF-8 Unicode encoding became the default. The fork system call for duplicating a process is fully implemented, but the copy-on-write optimization strategy could not be used. Cygwin's default user interface is the bash shell running in the mintty terminal emulator. The DLL also implements pseudo terminal (pty) devices, and Cygwin ships with a number of terminal emulators that are based on them, including rxvt/urxvt and xterm. The version of GCC that comes with Cygwin has various extensions for creating Windows DLLs, such as specifying whether a program is a windowing or console-mode program. Support for compiling programs that do not require the POSIX compatibility layer provided by the Cygwin DLL used to be included in the default GCC, but as of 2014, it is provided by cross-compilers contributed by the MinGW-w64 project. == Software packages == Cygwin's base package selection is approximately 100MB, containing the bash (interactive user) and dash (installation) shells and the core file and text manipulation utilities. Additional packages are available as optional installs from within the Cygwin "setup" program and package manager ("setup-x86_64.exe" – 64 bit). The Cygwin Ports project provided additional packages that were not available in the Cygwin distribution itself. Examples included GNOME, K Desktop Environment 3, MySQL database, and the PHP scripting language. Most ports have been adopted by volunteer maintainers as Cygwin packages, and Cygwin Ports are no longer maintained. Cygwin ships with GTK+ and Qt. The Cygwin/X project allows graphical Unix programs to display their user interfaces on the Windows desktop for both local and remote programs.