AI Avatar Kiosk

AI Avatar Kiosk — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • ImageNet

    ImageNet

    The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes. == History == AI researcher Fei-Fei Li began working on the idea for ImageNet in 2006. At a time when most AI research focused on models and algorithms, Li wanted to expand and improve the data available to train AI algorithms. In 2007, Li met with Princeton professor Christiane Fellbaum, one of the creators of WordNet, to discuss the project. As a result of this meeting, Li went on to build ImageNet starting from the roughly 22,000 nouns of WordNet and using many of its features. She was also inspired by a 1987 estimate that the average person recognizes roughly 30,000 different kinds of objects. As an assistant professor at Princeton, Li assembled a team of researchers to work on the ImageNet project. They used Amazon Mechanical Turk to help with the classification of images. Labeling started in July 2008 and ended in April 2010. It took 49K workers from 167 countries filtering and labeling over 160M candidate images. They had enough budget to have each of the 14 million images labelled three times. The original plan called for 10,000 images per category, for 40,000 categories at 400 million images, each verified 3 times. They found that humans can classify at most 2 images/sec. At this rate, it was estimated to take 19 human-years of labor (without rest). They presented their database for the first time as a poster at the 2009 Conference on Computer Vision and Pattern Recognition (CVPR) in Florida, titled "ImageNet: A Preview of a Large-scale Hierarchical Dataset". The poster was reused at Vision Sciences Society 2009. In 2009, Alex Berg suggested adding object localization as a task. Li approached PASCAL Visual Object Classes contest in 2009 for a collaboration. It resulted in the subsequent ImageNet Large Scale Visual Recognition Challenge starting in 2010, which has 1000 classes and object localization, as compared to PASCAL VOC which had just 20 classes and 19,737 images (in 2010). === Significance for deep learning === On 30 September 2012, a convolutional neural network (CNN) called AlexNet achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner-up. Using convolutional neural networks was feasible due to the use of graphics processing units (GPUs) during training, an essential ingredient of the deep learning revolution. According to The Economist, "Suddenly people started to pay attention, not just within the AI community but across the technology industry as a whole." In 2015, AlexNet was outperformed by Microsoft's very deep CNN with over 100 layers, which won the ImageNet 2015 contest, having 3.57% error on the test set. Andrej Karpathy estimated in 2014 that with concentrated effort, he could reach 5.1% error rate, and ~10 people from his lab reached ~12-13% with less effort. It was estimated that with maximal effort, a human could reach 2.4%. == Dataset == ImageNet crowdsources its annotation process. Image-level annotations indicate the presence or absence of an object class in an image, such as "there are tigers in this image" or "there are no tigers in this image". Object-level annotations provide a bounding box around the (visible part of the) indicated object. ImageNet uses a variant of the broad WordNet schema to categorize objects, augmented with 120 categories of dog breeds to showcase fine-grained classification. In 2012, ImageNet was the world's largest academic user of Mechanical Turk. The average worker identified 50 images per minute. The original plan of the full ImageNet would have roughly 50M clean, diverse and full resolution images spread over approximately 50K synsets. This was not achieved. The summary statistics given on April 30, 2010: Total number of non-empty synsets: 21841 Total number of images: 14,197,122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million === Categories === The categories of ImageNet were filtered from the WordNet concepts. Each concept, since it can contain multiple synonyms (for example, "kitty" and "young cat"), so each concept is called a "synonym set" or "synset". There were more than 100,000 synsets in WordNet 3.0, majority of them are nouns (80,000+). The ImageNet dataset filtered these to 21,841 synsets that are countable nouns that can be visually illustrated. Each synset in WordNet 3.0 has a "WordNet ID" (wnid), which is a concatenation of part of speech and an "offset" (a unique identifying number). Every wnid starts with "n" because ImageNet only includes nouns. For example, the wnid of synset "dog, domestic dog, Canis familiaris" is "n02084071". The categories in ImageNet fall into 9 levels, from level 1 (such as "mammal") to level 9 (such as "German shepherd"). === Image format === The images were scraped from online image search (Google, Picsearch, MSN, Yahoo, Flickr, etc) using synonyms in multiple languages. For example: German shepherd, German police dog, German shepherd dog, Alsatian, ovejero alemán, pastore tedesco, 德国牧羊犬. ImageNet consists of images in RGB format with varying resolutions. For example, in ImageNet 2012, "fish" category, the resolution ranges from 4288 x 2848 to 75 x 56. In machine learning, these are typically preprocessed into a standard constant resolution, and whitened, before further processing by neural networks. For example, in PyTorch, ImageNet images are by default normalized by dividing the pixel values so that they fall between 0 and 1, then subtracting by [0.485, 0.456, 0.406], then dividing by [0.229, 0.224, 0.225]. These are the mean and standard deviations for ImageNet, so this whitens the input data. === Labels and annotations === Each image is labelled with exactly one wnid. Dense SIFT features (raw SIFT descriptors, quantized codewords, and coordinates of each descriptor/codeword) for ImageNet-1K were available for download, designed for bag of visual words. The bounding boxes of objects were available for about 3000 popular synsets with on average 150 images in each synset. Furthermore, some images have attributes. They released 25 attributes for ~400 popular synsets: Color: black, blue, brown, gray, green, orange, pink, red, violet, white, yellow Pattern: spotted, striped Shape: long, round, rectangular, square Texture: furry, smooth, rough, shiny, metallic, vegetation, wooden, wet === ImageNet-21K === The full original dataset is referred to as ImageNet-21K. ImageNet-21k contains 14,197,122 images divided into 21,841 classes. Some papers round this up and name it ImageNet-22k. The full ImageNet-21k was released in Fall of 2011, as fall11_whole.tar. There is no official train-validation-test split for ImageNet-21k. Some classes contain only 1-10 samples, while others contain thousands. === ImageNet-1K === There are various subsets of the ImageNet dataset used in various context, sometimes referred to as "versions". One of the most highly used subsets of ImageNet is the "ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012–2017 image classification and localization dataset". This is also referred to in the research literature as ImageNet-1K or ILSVRC2017, reflecting the original ILSVRC challenge that involved 1,000 classes. ImageNet-1K contains 1,281,167 training images, 50,000 validation images and 100,000 test images. Each category in ImageNet-1K is a leaf category, meaning that there are no child nodes below it, unlike ImageNet-21K. For example, in ImageNet-21K, there are some images categorized as simply "mammal", whereas in ImageNet-1K, there are only images categorized as things like "German shepherd", since there are no child-words below "German shepherd". === Later developments === In the WordNet they built ImageNet on, there were 2832 synsets in the "person" subtree. During 2018--2020 period, they removed the download of the ImageNet-21k as they went through extensive filtering in these person synsets. Out of these 2832 synsets, 1593 were deemed "potentially offensive". Out of the remaining 1239, 1081 were deemed not really "visual". The result was that only 158 syn

    Read more →
  • SciDB

    SciDB

    SciDB is a column-oriented database management system (DBMS) designed for multidimensional data management and analytics common to scientific, geospatial, financial, and industrial applications. It is developed by Paradigm4 and co-created by Michael Stonebraker. == History == Stonebraker claims that arrays are 100 times faster in SciDB than in a relational DBMS on a class of problems. It is swapping rows and columns for mathematical arrays that put fewer restrictions on the data and can work in any number of dimensions unlike the conventionally widely used relational database management system model, in which each relation supports only one dimension of records. A 2011 conference presentation on SciDB promoted it as "not Hadoop". Marilyn Matz became chief executive Paradigm4 in 2014.

    Read more →
  • Environmental informatics

    Environmental informatics

    Environmental informatics is the science of information applied to environmental science. As such, it provides the information processing and communication infrastructure to the interdisciplinary field of environmental sciences aiming at data, information and knowledge integration, the application of computational intelligence to environmental data as well as the identification of environmental impacts of information technology. Environmental informatics thus acts as a bridge, providing an interdisciplinary means of analysing, describing and understanding the complex interactions between humans, nature and technology. Since each field of applied computer science has its own subject matter, terminology and methods, specialised disciplines, such as environmental, bio- and geoinformatics have emerged, each of which combines computer science with a specific field of application such as environmental, bio- or geosciences. Environmental informatics, bioinformatics and geoinformatics all deal with computer-based processing of environmental phenomena. However, environmental informatics is the only field that pursues normative goals (e.g., political goals of environmental protection, environmental planning, and sustainability). This also influences the choice of methods. This also distinguishes it from application areas such as numerical weather prediction, which is considered an early and important example of computer simulation of environmental phenomena. The UK Natural Environment Research Council defines environmental informatics as the "research and system development focusing on the environmental sciences relating to the creation, collection, storage, processing, modelling, interpretation, display and dissemination of data and information." Kostas Karatzas defined environmental informatics as the "creation of a new 'knowledge-paradigm' towards serving environmental management needs." Karatzas argued further that environmental informatics "is an integrator of science, methods and techniques and not just the result of using information and software technology methods and tools for serving environmental engineering needs." Environmental informatics emerged in early 1990 in Central Europe. Current initiatives to effectively manage, share, and reuse environmental and ecological data are indicative of the increasing importance of fields like environmental informatics and ecoinformatics to develop the foundations for effectively managing ecological information. Examples of these initiatives are National Science Foundation Datanet projects, DataONE and Data Conservancy. == Subject matter and objectives == The subject of environmental informatics are environmental information systems (EIS). An EIS 'is a computer-based system that integrates and stores data collected about the natural environment and provides powerful methods for accessing and evaluating it.' This allows environmental data to be processed by computers for environmental protection, planning, research and technology. According to Jaeschke and Bossel, environmental informatics has three interrelated objectives: Environmental informatics serves to procure data and information for describing the state and development of the environment. Of particular importance is information that is needed to prevent or limit undesirable changes and to support desirable changes. Based on the evaluation and analysis of data, environmental informatics improves our understanding of the environment and the interactions between nature, technology and society. It thus supports environmentally relevant decisions. This enables the influence of development (system correction), the assessment of the effects and side effects of potential measures, and the creation of tools for the routine planning, implementation and monitoring of measures. == History == The simulation model World3, which formed the basis of the highly acclaimed study The Limits to Growth, is considered the starting point of environmental informatics. It incorporated environmental information, among other things, to calculate scenarios for global development. In the mid-1980s, interest grew in structuring environmental protection as an area of application for computer science. One of the first publications in German was the book Informatik im Umweltschutz. Anwendungen und Perspektiven (Computer science in environmental protection. Applications and perspectives) from 1986. The term 'environmental informatics' did not appear until around 1993, which is why the development of environmental informatics is usually referred to as having taken place in the 1990s. In 1993, the first university chair for environmental informatics was established in Cottbus. In 1994, the anthology Umweltinformatik. Informatikmethoden für Umweltschutz und Umweltforschung (Environmental Informatics: Informatics Methods for Environmental Protection and Environmental Research) was published. The development of environmental informatics was 'primarily initiated by German computer science.' In the English-speaking world, the volume Environmental Informatics was published in 1995, mainly based on the German anthology of 1994. An article in the conference proceedings of the World Computer Congress of the International Federation for Information Processing (IFIP) in Hamburg in 1994 describes the initial situation of environmental informatics as follows: 'On the one hand, we suffer from the huge amount of available data – people sometimes speak of data graveyards – on the other hand, the really relevant data may still be missing.' This statement indicates the need that led to the emergence of environmental informatics as a specialised discipline of applied computer science. Furthermore, the specific characteristics and processing requirements of environmental data necessitated the emergence of environmental informatics. The special features of environmental data include: The data structures required are highly heterogeneous due to specific processes and differing perspectives on environmental aspects (e.g., water protection, emission control, hazardous substances). In addition to the heterogeneity of the data, heterogeneous databases also play a role, as environmental data is often obtained and presented in an interdisciplinary manner. Obligations change frequently as a result of new legislation, whether regional (e.g. state regulations on water protection), national (e.g. federal emission control regulations) or international (e.g. Registration, Evaluation, Authorisation and Restriction of Chemicals|REACH). The objects represented are often multidimensional and, therefore, require complex geometric representation using curves or polygons. It is often necessary to process uncertain, imprecise or incomplete data, which is, for example, the result of extrapolations or forecasts. A new "knowledge paradigm" has emerged to meet the requirements of environmental management. Environmental informatics produces its own concepts, methods and techniques and is not merely the result of using information and communication technology methods and tools to meet environmental requirements. The development of environmental informatics since the 1990s has been significantly influenced by the newly established conferences EnviroInfo, ISESS and ITEE and is documented in the respective proceedings. Aspects of sustainability and sustainable development were increasingly integrated into environmental informatics after 2000, thereby expanding the field. In 2004, the Working Group on Sustainable Information Society of the Gesellschaft für Informatik e. V. (German Informatics Society, GI) published the Memorandum on a Sustainable Information Society, which formulates recommendations for an information society that is compatible with human, social and natural needs. Since 2007, environmental informatics has often been described in more detail as informatics for environmental protection, sustainable development and risk management. The increased focus on sustainability has also contributed to the formation of the research focus Information and Communications Technology for Sustainability (ICT4S) and to the emergence of the international conference ICT4S in 2013. ICT-ENSURE, the European Commission's funding measure for the establishment of a European research area on "ICT for Environmental Sustainability Research" (2008–2010), has also contributed to the structuring of environmental informatics. == Environmental informatics and sustainable development == Efforts to place environmental informatics within the context of sustainable development have been growing since 2000 and were significantly influenced by the Memorandum on a Sustainable Information Society. According to this Memorandum, the information society offers great but unevenly distributed opportunities for education, participation and intercultural understanding. In addition, the Memorandum highlighted the material and energy consumption of inf

    Read more →
  • Data Science Africa

    Data Science Africa

    Data Science Africa (DSA) is a non-profit knowledge sharing professional group that aims at bringing together leading researchers and practitioners working on data science methods or applications relevant to Africa, and providing training on state of the art data science methods to students and others interested in developing practical skills. Since 2013, DSA has been organizing conference, workshops and summer schools on machine learning and data science across East Africa. Facilitators of Summer School and workshops are researchers and practitioners from the academia, private and public institutions across the world. == Summer schools and workshops == The first summer school which started as Gaussian Process Summer School was held at Makerere University in Kampala, Uganda from 6th to 9 August 2013. The First Data Science Summer School and Workshop was held at Dedan Kimathi University of Technology in Nyeri, Kenya from 15th to 19 June 2015. The Second Data Science Summer School was held at Makerere University, Kampala, Uganda from 27th to 29 July 2016, and the workshop was held at Pulse Lab, Kampala, Uganda from 30 July to 1 August 2016. The Third Data Science Summer School and Workshop was held at Nelson Mandela African Institute of Science and Technology, Tanzania from 19th to 21 July 2017. Among the sponsors of the event was ARM

    Read more →
  • Client-side persistent data

    Client-side persistent data

    Client-side persistent data or CSPD is a term used in computing for storing data required by web applications to complete internet tasks on the client-side as needed rather than exclusively on the server. As a framework it is one solution to the needs of Occasionally connected computing or OCC. A major challenge for HTTP as a stateless protocol has been asynchronous tasks. The AJAX pattern using XMLHttpRequest was first introduced by Microsoft in the context of the Outlook e-mail product. The first CSPD were the 'cookies' introduced by the Netscape Navigator. ActiveX components which have entries in the Windows registry can also be viewed as a form of client-side persistence.

    Read more →
  • Mobile content management system

    Mobile content management system

    A mobile content management system (MCMs) is a type of content management system (CMS) capable of storing and delivering content and services to mobile devices, such as mobile phones, smart phones, and PDAs. Mobile content management systems may be discrete systems, or may exist as features, modules or add-ons of larger content management systems capable of multi-channel content delivery. Mobile content delivery has unique, specific constraints including widely variable device capacities, small screen size, limitations on wireless bandwidth, sometimes small storage capacity, and (for some devices) comparatively weak device processors. Demand for mobile content management increased as mobile devices became increasingly ubiquitous and sophisticated. MCMS technology initially focused on the business to consumer (B2C) mobile market place with ringtones, games, text-messaging, news, and other related content. Since, mobile content management systems have also taken root in business-to-business (B2B) and business-to-employee (B2E) situations, allowing companies to provide more timely information and functionality to business partners and mobile workforces in an increasingly efficient manner. A 2008 estimate put global revenue for mobile content management at US$8 billion. == Key features == === Multi-channel content delivery === Multi-channel content delivery capabilities allow users not to manage a central content repository while simultaneously delivering that content to mobile devices such as mobile phones, smartphones, tablets and other mobile devices. Content can be stored in a raw format (such as Microsoft Word, Excel, PowerPoint, PDF, Text, HTML etc.) to which device-specific presentation styles can be applied. === Content access control === Access control includes authorization, authentication, access approval to each content. In many cases the access control also includes download control, wipe-out for specific user, time specific access. For the authentication, MCM shall have basic authentication which has user ID and password. For higher security many MCM supports IP authentication and mobile device authentication. === Specialized templating system === While traditional web content management systems handle templates for only a handful of web browsers, mobile CMS templates must be adapted to the very wide range of target devices with different capacities and limitations. There are two approaches to adapting templates: multi-client and multi-site. The multi-client approach makes it possible to see all versions of a site at the same domain (e.g. sitename.com), and templates are presented based on the device client used for viewing. The multi-site approach displays the mobile site on a targeted sub-domain (e.g. mobile.sitename.com). === Location-based content delivery === Location-based content delivery provides targeted content, such as information, advertisements, maps, directions, and news, to mobile devices based on current physical location. Currently, GPS (global positioning system) navigation systems offer the most popular location-based services. Navigation systems are specialized systems, but incorporating mobile phone functionality makes greater exploitation of location-aware content delivery possible.

    Read more →
  • Causal AI

    Causal AI

    Causal AI is a technique in artificial intelligence that builds a causal model and can thereby make inferences using causality rather than just correlation. One practical use for causal AI is for organisations to explain decision-making and the causes for a decision. Systems based on causal AI, by identifying the underlying web of causality for a behaviour or event, provide insights that solely predictive AI models might fail to extract from historical data. An analysis of causality may be used to supplement human decisions in situations where understanding the causes behind an outcome is necessary, such as quantifying the impact of different interventions, policy decisions or performing scenario planning. A 2024 paper from Google DeepMind demonstrated mathematically that "Any agent capable of adapting to a sufficiently large set of distributional shifts must have learned a causal model". The paper offers the interpretation that learning to generalise beyond the original training set requires learning a causal model, concluding that causal AI is necessary for artificial general intelligence. == History == The concept of causal AI and the limits of machine learning were raised by Judea Pearl, the Turing Award-winning computer scientist and philosopher, in 2018's The Book of Why: The New Science of Cause and Effect. Pearl asserted: “Machines' lack of understanding of causal relations is perhaps the biggest roadblock to giving them human-level intelligence.” In 2020, Columbia University established a Causal AI Lab under Director Elias Bareinboim. Professor Bareinboim's research focuses on causal and counterfactual inference and their applications to data-driven fields in the health and social sciences as well as artificial intelligence and machine learning. Technological research and consulting firm Gartner for the first time included causal AI in its 2022 Hype Cycle report, citing it as one of five critical technologies in accelerated AI automation. Causal AI is closely related to but distinct from fields such as causal inference, explainable AI and causal reasoning. While causal inference focuses on estimating cause-effect relationships (often from observational data), causal AI emphasises the integration of those causal models into AI systems for prediction, planning and adaptation.

    Read more →
  • Informetrics

    Informetrics

    Informetrics is the study of quantitative aspects of information, it is an extension and evolution of traditional bibliometrics and scientometrics. Informetrics uses bibliometrics and scientometrics methods to study mainly the problems of literature information management and evaluation of science and technology. Informetrics is an independent discipline that uses quantitative methods from mathematics and statistics to study the process, phenomena, and law of informetrics. Informetrics has gained more attention as it is a common scientific method for academic evaluation, research hotspots in discipline, and trend analysis. Informetrics includes the production, dissemination, and use of all forms of information, regardless of its form or origin. Informetrics encompasses the following fields: Scientometrics, which studies quantitative aspects of science Webometrics, which studies quantitative aspects of the World Wide Web Bibliometrics, which studies quantitative aspects of recorded information Cybermetrics, which is similar to webometrics, but broadens its definition to include electronic resources == Origin and Development == The term informetrics (French: informétrie) was coined by German scholar Otto Nacke in 1979, and came from the German word 'informetrie’. The corresponding English terminology soon appeared in the subsequent literature. In September 1980, Professor Otto Nacke introduced the term 'informetrics' at the first seminar on Informetrics in Frankfurt, Germany. Later, Committee on Informetrics has established through The International Federation for Information and Documentation (FID). In 1987, informetrics started to be officially recognized by the international information community and several foreign information scientists. In 1988, at First International Conference on Bibliometrics and Theoretical Aspects of Information Retrieval Archived 2022-05-23 at the Wayback Machine, Brooks suggested bibliometrics and scientometrics can be included in the field of informetrics. In 1990, Leo Egghe and Ronald Rousseau proposed the formation of the discipline of informetrics: statistical bibliography (1923) to bibliometrics and scientometrics (1969) and then to informetrics (1979). In 1993, the International Society for Scientometrics and Informetrics (ISSI) Archived 2023-11-05 at the Wayback Machine was founded at the International Conference on Bibliometrics, Informetrics and Scientometrics in Berlin, and the first one was held in Belgium and organized by Leo Egghe and Ronald Rousseau. The society was formally incorporated in 1994 in the Netherlands and plays a significant role in the development of informetrics. The ISSI aims to promote the "exchange and communication of professional information in the fields of scientometrics and informetrics, including improve standards, theory and practice, as well as promote research, education and training". In addition, to "engage in relevant public conversation and policy discussions". In the western world, 20th century's Informetrics is mostly based on Lotka's law, named after Alfred J. Lotka, Zipf's law, named after George Kingsley Zipf, Bradford's law named after Samuel C. Bradford and on the work of Derek J. de Solla Price, Gerard Salton, Leo Egghe, Ronald Rousseau, Tibor Braun, Olle Persson, Peter Ingwersen, Manfred Bonitz, and Eugene Garfield. == Difference Between Informetrics, Bibliometrics and Scientometrics == Since the 1960s, three similar terms have emerged in the fields of library science, philology and science of science, they are bibliometrics, scientometrics and informetrics, representing three very similar quantitative sub-disciplines. The three metrics terms can be confusing and often misused. Informetrics and bibliometrics interpenetrate each other but have different aspects in research object, research scope, and measuring unit. Informetrics and scientometrics are very different in their research purpose and research object, as well as the research scope and application. Bibliometrics is categorised under the field of library science, it uses mathematical and statistical methods to describe, evaluate, and predict the current status and trends of science and technology. Also to study the "distribution structure, quantitative relationship, change law and quantitative management of literature information, quantitative relationships, patterns and quantitative management of literature and information". The term was first used by Alan Pritchard in 1969 in his paper Statistical Bibliography or Bibliometrics?. Scientometrics is a branch of science that quantitatively evaluates and predicts the process and management of scientific activities in order to reveal their development patterns and trends. The definition of scientometrics was described by Derek De Solla Price in his book Science to Science as the “quantitative study of science, communication in science, and science policy”. === Links between the three metrics terms === The most prominent connection between the three metrics terms is in their research objects. Since all three disciplines use literature information as their research object, therefore, they have some similarities and overlaps in their research methods and fields. Moreover, they all use mathematical methods as the basic research methods and they all apply the three basic laws, Bradford's law, Lotka's law and Zipf's law. === Distinctions between the three metrics terms === The distinction between the three metrics terms can tell from their research object and research purpose. The research of bibliometrics focuses on the analysis of "scientific output in the form of articles, publications, citations, and others". Scientometrics is to measure the basic characteristics and laws of scientific activities. Where informetrics is to investigate information sources and information distribution process. == Concept and System Structure == === Purpose of Informetrics Research === The main purpose of informetrics is to use its theocratical research to solve the methodological issues in the research process, and to discover and reveal the basic laws of information distribution through the study of information process and phenomenon. In this way, makes information management more scientific and provides a quantitative basis for information services and information management decisions. For informetrics, it is necessary to bring quantitative analysis methods to further reveal the structure of information units and the "quantitative change law of literature information”. Further to this, to improve the scientific accuracy of information science from a theoretical point of view. At the same time, to better solve the basic contradictions in the information service, overcome the information crisis, and make the information management work more effective to serve science and technology, economic and social development. Quantitative analysis of bibliographic data was pioneered by Robert K. Merton in an article called Science, Technology, and Society in Seventeenth Century England and originally published by Merton in 1938. === The Significance of Informetrics Research === The significance of informetrics research is to summarize various empirical laws from the theoretical point of view, at the same time test and modify the various empirical laws in the new information unit conditions, and explore its new applicability, therefore, the scientific nature of information science can be improved, but also to provide theoretical guidance for practical work. === The Objects of Informetrics Research === The object of informetrics is broader than the field of bibliometrics and scientometrics, including "messages, data, events, objects, text, and documents”. Informetrics is often used to inform policies and decisions across a broad range of fields, such as economy, politics, technology and social spheres that "influence the flow and use patterns of information". Tague-Sutcliffe describes the following uses of informetrics: Citation analysis; Characteristics of authors; Use of recorded information; Obsolescence of the literature; Concomitant growth of new concepts; Characteristics of publication sources; Definition and measurement o information; Growth of subject literature, databases, libraries; Types and characteristics of retrieval performance measures; Statistical aspects of language, word, and phrase frequencies. == Basic Laws == In the field of informetrics research, there are many outstanding contributors in the discipline with a solid knowledge of quantitative research methods. In the early 20th century, several scientists contributed empirical applications that have become the three basic laws of informetrics, Bradford's law, Lotka's law, and Zipf's law, which promote the development of informetrics. === Bradford's Law === The British documentalist and librarian Samuel C. Bradford first discovered the law of concentration and scattering of literature, and in 1934, it has be

    Read more →
  • Word error rate

    Word error rate

    Word error rate (WER) is a common metric of the performance of a speech recognition or machine translation system. The WER metric typically ranges from 0 to 1, where 0 indicates that the compared pieces of text are exactly identical, and 1 (or larger) indicates that they are completely different with no similarity. This way, a WER of 0.8 means that there is an 80% error rate for compared sentences. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. This kind of measurement, however, provides no details on the nature of translation errors and further work is therefore required to identify the main source(s) of error and to focus any research effort. This problem is solved by first aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. Examination of this issue is seen through a theory called the power law that states the correlation between perplexity and word error rate. Word error rate can then be computed as: W E R = S + D + I N = S + D + I S + D + C {\displaystyle {\mathit {WER}}={\frac {S+D+I}{N}}={\frac {S+D+I}{S+D+C}}} where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, N is the number of words in the reference (N=S+D+C) The intuition behind 'deletion' and 'insertion' is how to get from the reference to the hypothesis. So if we have the reference "This is wikipedia" and hypothesis "This _ wikipedia", we call it a deletion. Note that since N is the number of words in the reference, the word error rate can be larger than 1.0, namely if the number of insertions I is larger than the number of correct words C. When reporting the performance of a speech recognition system, sometimes word accuracy (WAcc) is used instead: W A c c = 1 − W E R = N − S − D − I N = C − I N {\displaystyle {\mathit {WAcc}}=1-{\mathit {WER}}={\frac {N-S-D-I}{N}}={\frac {C-I}{N}}} Since the WER can be larger than 1.0, the word accuracy can be smaller than 0.0. == Experiments == It is commonly believed that a lower word error rate shows superior accuracy in recognition of speech, compared with a higher word error rate. However, at least one study has shown that this may not be true. In a Microsoft Research experiment, it was shown that, if people were trained under "that matches the optimization objective for understanding", (Wang, Acero and Chelba, 2003) they would show a higher accuracy in understanding of language than other people who demonstrated a lower word error rate, showing that true understanding of spoken language relies on more than just high word recognition accuracy. == Other metrics == One problem with using a generic formula such as the one above, however, is that no account is taken of the effect that different types of error may have on the likelihood of successful outcome, e.g. some errors may be more disruptive than others and some may be corrected more easily than others. These factors are likely to be specific to the syntax being tested. A further problem is that, even with the best alignment, the formula cannot distinguish a substitution error from a combined deletion plus insertion error. Hunt (1990) has proposed the use of a weighted measure of performance accuracy where errors of substitution are weighted at unity but errors of deletion and insertion are both weighted only at 0.5, thus: W E R = S + 0.5 D + 0.5 I N {\displaystyle {\mathit {WER}}={\frac {S+0.5D+0.5I}{N}}} There is some debate, however, as to whether Hunt's formula may properly be used to assess the performance of a single system, as it was developed as a means of comparing more fairly competing candidate systems. A further complication is added by whether a given syntax allows for error correction and, if it does, how easy that process is for the user. There is thus some merit to the argument that performance metrics should be developed to suit the particular system being measured. Whichever metric is used, however, one major theoretical problem in assessing the performance of a system is deciding whether a word has been “mis-pronounced,” i.e. does the fault lie with the user or with the recogniser. This may be particularly relevant in a system which is designed to cope with non-native speakers of a given language or with strong regional accents. The pace at which words should be spoken during the measurement process is also a source of variability between subjects, as is the need for subjects to rest or take a breath. All such factors may need to be controlled in some way. For text dictation it is generally agreed that performance accuracy at a rate below 95% is not acceptable, but this again may be syntax and/or domain specific, e.g. whether there is time pressure on users to complete the task, whether there are alternative methods of completion, and so on. The term "Single Word Error Rate" is sometimes referred to as the percentage of incorrect recognitions for each different word in the system vocabulary. == Edit distance == The word error rate may also be referred to as the length normalized edit distance. The normalized edit distance between X and Y, d( X, Y ) is defined as the minimum of W( P ) / L ( P ), where P is an editing path between X and Y, W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

    Read more →
  • HAKMEM

    HAKMEM

    HAKMEM, alternatively known as AI Memo 239, is a February 1972 "memo" (technical report) of the MIT AI Lab containing a wide variety of hacks, including useful and clever algorithms for mathematical computation, some number theory and schematic diagrams for hardware – in Guy L. Steele's words, "a bizarre and eclectic potpourri of technical trivia". Contributors included about two dozen members and associates of the AI Lab. The title of the report is short for "hacks memo", abbreviated to six upper case characters that would fit in a single PDP-10 machine word (using a six-bit character set). == History == HAKMEM is notable as an early compendium of algorithmic technique, particularly for its practical bent, and as an illustration of the wide-ranging interests of AI Lab people of the time, which included almost anything other than AI research. HAKMEM contains original work in some fields, notably continued fractions. == Introduction == Compiled with the hope that a record of the random things people do around here can save some duplication of effort -- except for fun. Here is some little known data which may be of interest to computer hackers. The items and examples are so sketchy that to decipher them may require more sincerity and curiosity than a non-hacker can muster. Doubtless, little of this is new, but nowadays it's hard to tell. So we must be content to give you an insight, or save you some cycles, and to welcome further contributions of items, new or used.

    Read more →
  • Five safes

    Five safes

    The Five Safes is a framework for helping make decisions about making effective use of data which is confidential or sensitive. It is mainly used to describe or design research access to statistical data held by government and health agencies, and by data archives such as the UK Data Service. It is not an internationally accepted standard. Two of the Five Safes refer to statistical disclosure control, and so the Five Safes is usually used to contrast statistical and non-statistical controls when comparing data management options. == Concept == The Five Safes proposes that data management decisions be considered as solving problems in five 'dimensions': projects, people, settings, data and outputs. The combination of the controls leads to 'safe use'. These are most commonly expressed as questions, for example: These dimensions are scales, not limits. That is, solutions can have a mix of more or fewer controls in each dimension, but the overall aim of 'safe use' independent of the particular mix. For example, a public use file available for open download cannot control who uses it, where or for what purpose, and so all the control (protection) must be in the data itself. In contrast, a file which is only accessed through a secure environment with certified users can contain very sensitive information: the non-statistical controls allow the data to be 'unsafe'. One academic likened the process to a graphic equalizer, where bass and treble can be combined independently to produce a sound the listener likes, which has proven to be a very useful metaphor. This 2023 Data Foundation webinar is an expert discussion of how the elements interact, including an excellent introductory representation. There is no 'order' to the Five Safes, in that one is necessarily more important than the others. However, Ritchie argued that the 'managerial' controls (projects, people, setting) should be addressed before the 'statistical' controls (data, output). The Five Safes concept is associated with other topics which developed from the same programme at ONS, although these are not necessarily implemented. Safe people is associated with 'active researcher management', while safe outputs is linked with principles-based output statistical disclosure control. The Five Safes is a positive framework, describing what is and is not. The EDRU ('evidence-based, default-open, risk-managed, user-centred') attitudinal model is sometimes used to give a normative context == The 'data access spectrum' == From 2003 the Five Safes was also represented in a simpler form as a 'Data Access Spectrum'. The non-data controls (project, people, setting, outputs) tend to work together, in that organisations often see these as a complementary set of restrictions on access. These can then be contrasted with choices about data anonymisation to present a linear representation of data access options. This presentation is consistent with the idea of 'data as a residual', as well as data protection laws of the time which often characterised data simply as anonymous or not anonymous. A similar idea had already been developed independently in 2001 by Chuck Humphrey of the Canadian RDC network, the 'continuum of access'. More recently, The Open Data Institute has developed a 'Data Spectrum toolkit' which includes industry-specific examples. == History and terminology == The Five Safes was devised in the winter of 2002/2003 by Felix Ritchie at the UK Office for National Statistics (ONS) to describe its secure remote-access Virtual Microdata Laboratory (VML). It was described at this time as the 'VML Security Model'. This was adopted by the NORC data enclave, and more widely in the US, as the 'portfolio model' (although this is now also used to refer to a slightly different legal/statistical/educational breakdown). In 2012 the framework as was still being referred to as the 'VML security model', but its increasing use among non-UK organisations led to the adoption of the more general and informative phrase 'Five Safes'. The original framework only had four safes (projects, people, settings and outputs): the framework was used to describe highly detailed data access through a secure environment, and so the 'data' dimension was irrelevant. From 2007 onwards, 'safe data' was included as the framework was used to a describe a wider range of ONS activities. As the US version was based upon the 2005 specification, some US iterations uses have the original four dimensions (eg). Some discussions, such as the OECD, use the term 'secure' instead 'safe'. However, the use of both these terms can cause presentational problems: less control in a particular dimension could be seen to imply 'unsafe users' or 'insecure settings', for example, which distracts from the main message. Hence, the Australian government uses the term "five data sharing principles". The 'Anonymisation Decision-Making Framework' uses a framework based on the Five Safes but relabelling "projects", "people", and "settings" as "governance", "agency" and "infrastructure", respectively; "Output" is omitted, and "safe use" becomes "functional anonymisation". There is no reference to the Five Safes or any associated literature. The Australian version was required to include references to the Five Safes, and presented it as an alternative without comment. == Application == The framework has had three uses: pedagogical, descriptive, and design. Since 2016, it has also been used, directly and indirectly in legislation. See for more detailed examples. === Pedagogy === The first significant use of the framework, other than internal administrative use, was to structure researcher training courses at the UK Office for National Statistics from 2003. UK Data Archive, Administrative Data Research Network, Eurostat, Statistics New Zealand, the Mexican National Institute of Statistics and Geography, NORC, Statistics Canada and the Australian Bureau of Statistics, amongst others, have also used this framework. Most of these courses are for researchers using restricted-access facilities; the Eurostat courses are unusual in that they are designed for all users of sensitive data. === Description === The framework is often used to describe existing data access solutions (e.g. UK HMRC Data Lab, UK Data Service, Statistics New Zealand) or planned/conceptualised ones (e.g. Eurostat in 2011). An early use was to help identify areas where ONS' still had 'irreducible risks' in its provision of secure remote access. The framework is mostly used for confidential social science data. To date it appears to have made little impact on medical research planning, although it is now included in the revised guidelines on implementing HIPAA regulations in the US, and by Cancer Research UK and the Health Foundation in the UK. It has also been used to describe a security model for the Scottish Health Informatics Programme. === Design === In general the Five Safes has been used to describe solutions post-factum, and to explain/justify choices made, but an increasing number of organisations have used the framework to design data access solutions. For example, the Hellenic Statistical Agency developed a data strategy built around the Five Safes in 2016; the UK Health Foundation used the Five Safes to design its data management and training programmes. Use in the private sector is less common but some organisations have incorporated the Five Safes into consulting services. In 2015 the UK Data Service organized a workshop to encourage data users from the academic and private sectors to think about how to manage confidential research data, using the Five Safes to demonstrate alternative options and best practice. Early adopters for strategic design use were in Australia: both the Australian Bureau of Statistics and the Australian Department of Social Service used the Five Safes as an ex ante design tool. In 2017 the Australian Productivity Commission recommended adopting a version of the framework to support cross-government data sharing and re-use. This underwent extensive consultation and culminated in the DAT Act 2022. Since 2020 the Five Safes has been the overriding framework for the design of new secure facilities and data sharing arrangements in the UK for public health and social sciences. This has been promoted by the Office for Statistics Regulation, the UK Statistics Authority, NHS DIgital, and the research funding bodies Administrative Data Research UK and DARE UK. === Regulation and legislation === Three laws have incorporated the Fives Safes. They are explicit in the South Australian Public Sector (Data Sharing) Act 2016, and implicit in the research provisions of the UK Digital Economy Act 2017. The Australian Data Availability and Transparency Act 2022 renames the Five Safes as the Five Data Sharing Principles.A 2025 statutory review of the DAT Act 2022 found "that the DAT Act has not been effective in achieving its objectives.". The review includes specific referen

    Read more →
  • ARMA International

    ARMA International

    ARMA International (formerly the Association of Records Managers and Administrators) is an American not-for-profit professional association for information professionals – primarily information management (including records management) and information governance, and related industry practitioners and vendors. The association provides educational opportunities and publications covering aspects of information management broadly. == History == The Association was founded in 1955. In 1975, the Association of Records Executives and Administrators (AREA) and the American Records Management Association merged to form ARMA International. The headquarters for ARMA International is located in Overland Park, Kansas. == Operations == ARMA International services professionals in the United States, Canada, Japan, and the United Kingdom. Its members include records managers, attorneys, information technology professionals, consultants, and archivists involved in various aspects of managing records and information assets. ARMA hosts an annual conference with the goal of bringing together record and information management professionals from around the world – In 2023, ARMA hosted conferences in both the United States and Canada. Topics addressed in the 120+ educational sessions include advanced technology, creating information structure, ediscovery and information law, information management fundamentals, information project management, and reducing organizational information risk. The expo features exhibitors displaying records and information technologies, products, and services.

    Read more →
  • DryvIQ

    DryvIQ

    DryvIQ is a software application that enables businesses to migrate on-site system files and associated data across storage and content management platforms, as well as create synchronized hybrid storage systems. == History == Before it was DryvIQ, the software SkySync was released in 2013 by Ann Arbor, Michigan based company, Portal Architects, Inc. The company created SkySync, a back-end, administrative application designed to transfer content across storage platforms, after abandoning 18 months of development on a desktop application called SkyBrary in 2011. Between 2014 and 2015, Portal Architects established partnerships with the following companies: Autodesk, Box, Dropbox, Egnyte, EMC, Google, Syncplicity, Huddle, IBM, Microsoft, OpenText, Oracle, Citrix ShareFile, Hightail and Internet2. SkySync (currently DryvIQ) was named a "Cool Vendor in Content Management" by Gartner in 2015. In 2022, SkySync changed its name to DryvIQ, which is now what the company is currently known as. == Overview == DryvIQ is a software application that syncs, migrates or backs up files including their associated properties, metadata, versions, user accounts and permissions across on-premises and Cloud-based storage platforms. The software deploys on a server, virtual machine or within Microsoft Azure, Amazon Web Services or other cloud computing services.

    Read more →
  • User-subjective approach

    User-subjective approach

    The user-subjective approach is the first interaction design approach dedicated specifically to personal information management (PIM). The approach offers design principles with which PIM systems (e.g. operating systems, email applications and web browsers) can make systematic use of subjective (i.e. user-dependent) attributes. The approach evolved in three stages: (a) theoretical foundations first published in a Journal of the American Society for Information Science and Technology during 2003. The paper introduces the approach and its design principles (b) evidence and implementation was published in another JASIST paper in 2008. The paper gives empirical evidence in support of the approach as well as seven novel design schemes that derives from it. It has won the Best JASIST paper award in 2009.(c) specific design evaluation this stage has already begun with evaluation of the first user-subjective design prototype called GrayArea in a Conference on Human Factors in Computing Systems paper published in 2009. == Theoretical foundations == The user-subjective approach takes advantage of the fact that in PIM the person who retrieves the information is the same person who had previously stored it. PIM can be seen as a communication between the person and him\her self at two different times: the time of storage and the time of retrieval. The PIM system design should help facilitate that unique communication by allowing the user use subjective (user-dependent) attributes in addition to the standard objective ones. PIM systems should capture these subjective attributes when the user interacts with the information item (either automatically or by using direct manipulation interface) in order to help the user retrieve the item later on. The user-subjective approach identifies three subjective attributes – the project which the item was classified to, its degree of importance to the user, and the context in which the item was used during the interaction with it. The approach also assigns a design principle for each. The principles (discussed below) are deliberately abstract to allow for a variety of different implementations. === The subjective project classification principle === The subjective project classification principle suggests that PIM systems design should allow all information items related to a project be classified under the same category regardless of whether they are files, emails, Web Favorites or of any other format. This stands in sharp contrast with the present PIM system design where there are distinct folder hierarchies for each of these formats. The current design forces the user to store information related to a single project in separate locations depending on their format causing the project fragmentation problem. === The subjective importance principle === The subjective importance principle suggests that the subjective importance of information should affect its degree of visual salience and accessibility: important information items should be highly visible and accessible as they are more likely to be retrieved (the promotion principle) and those of lower importance should be demoted (i.e. making them less visible) so as not to distract the user (the demotion principle). While the promotion principle is not new and has been widely applied in PIM system design, the demotion principle is novel and has been applied only sporadically in these systems. Currently these systems allow only two options: keeping information (where unneeded information items could clutter folders and obscure the target item) and deleting it (where there is a risk that the item will not be there when needed). Demotion suggests a third option where the item is less visible so it doesn’t distract the user but is kept within its original context in case the user would need it after all. === The subjective context principle === The subjective context principle suggests that PIM systems should allow users retrieve their information items in the same context that they had previously used in order to bridge the time gap between these two events. By "context" the approach refers to other information items that were used at the time of interaction with the item, thoughts that the users may have regarding the item, the phase the user got to in the interaction with the item and other people the user collaborates with regarding the information item. == Evidence and implementations == === Evidence === The user-subjective approach was evaluated in a multioperational designed study which used questionnaires, screen shots and in-depth interviews (N = 84). The research tested the use of subjective attributes in current PIM systems and its dependency on design. Results show that participants used subjective attributes whenever design allowed them to. When it didn't, they either used their own alternative ways to use these attributes or avoided using subjective attributes at all. Regarding the subjective project classification principle – many of the participants' recent files, emails and web pages related to the same projects (indicating that they were working on the same project using different formats), and they had saved files of different format in the same project folders. However, as design does not suggest storing emails and web favorites with files, users avoid doing so. Regarding the subjective importance principle – users tended to retrieve their important information from highly visible and accessible locations offered by current design (e.g. by using the desktop), however since current systems offers no way to demote files of low subjective importance participants tended to use their own walk around ways for doing so (e.g. by moving them to a folder called "old" inside their original folder). Regarding the subjective context principle – participants tended to talk spontaneously about the context of their information items during the interview. These evidence imply that current PIM systems could possibly be improved if it would allow users to make more use of subjective attributes of their personal information. === Implementations === Each of the user-subjective design principles can be implemented in various ways. Moreover, as the approach is generative it offers PIM designers to use these principles in order to create their own user subjective designs. Below are design schemes that demonstrate an implementation of each of the principles. A more complete set of implementation examples can be found in the user-subjective website Archived 2011-02-01 at the Wayback Machine. The single hierarchy solution – addresses the project fragmentation problem (the current situation where the users stores and retrieve their project-related files, emails and web favorites at different hierarchies) and implements the subjective classification principle by offering the user a single folder hierarchy for all information items. At the operation system level the users would navigate to a folder and find there all project related files, emails, web favorites, tasks, contacts and notes. This would allow them to retrieve all their project-related information items from a single location regardless of their formats. When looking at these folders at their mail box the users would see only their emails and only web favorites through their browser. The single hierarchy design scheme has not been evaluated yet. GrayArea – implements the demotion principle by allowing users to move subjectively unimportant files to a gray area at the bottom end of their folders. This clears the upper part of the folder from file that are unlikely to be retrieved while allowing the users to retrieve these unimportant file in their original context in case they are needed after all. GrayArea design scheme was positively evaluated (see next section). ItemHistory – is an implementation of the subjective context principle. It allows users to reach all information items that were previously retrieved while that information item was open. This design scheme has not been evaluated to date. == Specific design evaluation == The evaluation of specific designs is the third and final step of the approach development. It had begun with the assessment of GrayArea. === GrayArea evaluation === GrayArea was evaluated by using a prototype that simulated the participants' folders but included a gray area where they could drag & drop their subjectively unimportant files. In the study 96 participants were asked to clean up their folders from unimportant files once with GrayArea and once without it. Results show that the use of GrayArea reduced the clutter in folders, that it was easier for participants to demote files than to delete them and that they would use it if provided in their next operating system. These results encourage commercial implementation of GrayArea and the development and testing of other user-subjective designs. == Chronological development == The user-subjective approach was developed by

    Read more →
  • Operational data store

    Operational data store

    An operational data store (ODS) is used for operational reporting and as a source of data for the enterprise data warehouse (EDW). It is a complementary element to an EDW in a decision support environment, and is used for operational reporting, controls, and decision making, as opposed to the EDW, which is used for tactical and strategic decision support. An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting. An ODS should not be confused with an enterprise data hub (EDH). An operational data store will take transactional data from one or more production systems and loosely integrate it, in some respects it is still subject oriented, integrated and time variant, but without the volatility constraints. This integration is mainly achieved through the use of EDW structures and content. An ODS is not an intrinsic part of an EDH solution, although an EDH may be used to subsume some of the processing performed by an ODS and the EDW. An EDH is a broker of data. An ODS is certainly not. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the data warehouse generally on a less-frequent basis. == General use == The general purpose of an ODS is to integrate data from disparate source systems in a single structure, using data integration technologies like data virtualization, data federation, or extract, transform, and load (ETL). This will allow operational access to the data for operational reporting, master data or reference data management. An ODS is not a replacement or substitute for a data warehouse or for a data hub but in turn could become a source.

    Read more →