AI Detector App

AI Detector App — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Cloud robotics

    Cloud robotics

    Cloud robotics is a field of robotics that attempts to invoke cloud technologies such as cloud computing, cloud storage, and other Internet technologies centered on the benefits of converged infrastructure and shared services for robotics. When connected to the cloud, robots can benefit from the powerful computation, storage, and communication resources of a modern data center in the cloud, which can process and share information from various robots or agents (other machines, smart objects, humans, etc.). Humans can also delegate tasks to robots remotely through networks. Cloud computing technologies enable robot systems to be gain capability whilst reducing costs through cloud technologies. Thus, it is possible to build lightweight, low-cost, smarter robots with an intelligent "brain" in the cloud. The "brain" consists of data center, knowledge base, task planners, deep learning, information processing, environment models, communication support, etc. == Components == A cloud for robots potentially has at least six significant components: Building a "cloud brain" for robots, the main object of cloud robotics; Offering a global library of images, maps, and object data, often with geometry and mechanical properties, expert system, knowledge base (i.e. semantic web, data centres); Massively-parallel computation on demand for sample-based statistical modelling and motion planning, task planning, multi-robot collaboration, scheduling and coordination of system; Robot sharing of outcomes, trajectories, and dynamic control policies and robot learning support; Human sharing of open-source code, data, and designs for programming, experimentation, and hardware construction; On-demand human guidance and assistance for evaluation, learning, and error recovery; Augmented human–robot interaction through various ways (semantics knowledge base, Apple SIRI like service, etc.). == Applications == Autonomous mobile robots Google's self-driving cars are cloud robots. The cars use the network to access Google's enormous database of maps and satellite and environment model (like Streetview) and combines it with streaming data from GPS, cameras, and 3D sensors to monitor its own position within centimetres, and with past and current traffic patterns to avoid collisions. Each car can learn something about environments, roads, or driving, or conditions, and it sends the information to the Google cloud, where it can be used to improve the performance of other cars. Cloud medical robots a medical cloud (also called a healthcare cluster) consists of various services such as a disease archive, electronic medical records, a patient health management system, practice services, analytics services, clinic solutions, expert systems, etc. A robot can connect to the cloud to provide clinical service to patients, as well as deliver assistance to doctors (e.g. a co-surgery robot). Moreover, it also provides a collaboration service by sharing information between doctors and care givers about clinical treatment. Assistive robots A domestic robot can be employed for healthcare and life monitoring for elderly people. The system collects the health status of users and exchange information with cloud expert system or doctors to facilitate elderly peoples life, especially for those with chronic diseases. For example, the robots are able to provide support to prevent the elderly from falling down, emergency healthy support such as heart disease, blooding disease. Care givers of elderly people can also get notification when in emergency from the robot through network. Industrial robots As highlighted by the German government's Industry 4.0 Plan, "Industry is on the threshold of the fourth industrial revolution. Driven by the Internet, the real and virtual worlds are growing closer and closer together to form the Internet of Things. Industrial production of the future will be characterised by the strong individualisation of products under the conditions of highly flexible (large series) production, the extensive integration of customers and business partners in business and value-added processes, and the linking of production and high-quality services leading to so-called hybrid products." In manufacturing, such cloud based robot systems could learn to handle tasks such as threading wires or cables, or aligning gaskets from a professional knowledge base. A group of robots can share information for some collaborative tasks. Even more, a consumer is able to place customised product orders to manufacturing robots directly with online ordering systems. Another potential paradigm is shopping-delivery robot systems. Once an order is placed, a warehouse robot dispatches the item to an autonomous car or autonomous drone to deliver it to its recipient. == Research == RoboEarth was funded by the European Union's Seventh Framework Programme for research, technological development projects, specifically to explore the field of cloud robotics. The goal of RoboEarth is to allow robotic systems to benefit from the experience of other robots, paving the way for rapid advances in machine cognition and behaviour, and ultimately, for more subtle and sophisticated human-machine interaction. RoboEarth offers a Cloud Robotics infrastructure. RoboEarth's World-Wide-Web style database stores knowledge generated by humans – and robots – in a machine-readable format. Data stored in the RoboEarth knowledge base include software components, maps for navigation (e.g., object locations, world models), task knowledge (e.g., action recipes, manipulation strategies), and object recognition models (e.g., images, object models). The RoboEarth Cloud Engine includes support for mobile robots, autonomous vehicles, and drones, which require much computation for navigation. Rapyuta is an open source cloud robotics framework based on RoboEarth Engine developed by the robotics researcher at ETHZ. Within the framework, each robot connected to Rapyuta can have a secured computing environment (rectangular boxes) giving them the ability to move their heavy computation into the cloud. In addition, the computing environments are tightly interconnected with each other and have a high bandwidth connection to the RoboEarth knowledge repository. FogROS2 is an open-source extension to the Robot Operating System 2 (ROS 2) developed by researchers at UC Berkeley. It enables robots to offload computationally intensive tasks—such as SLAM, grasp planning, and motion planning—to cloud resources, thereby enhancing performance and reducing onboard computational requirements. FogROS2 automates the provisioning of cloud instances, deployment of ROS 2 nodes, and secure communication between robots and cloud services. The platform is designed to be compatible with existing ROS 2 applications without requiring code modifications. Further advancements include FogROS2-SGC, which facilitates secure global connectivity across different networks and locations, and FogROS2-FT, which introduces fault tolerance by replicating services across multiple cloud providers to ensure robustness against failures. KnowRob is an extensional project of RoboEarth. It is a knowledge processing system that combines knowledge representation and reasoning methods with techniques for acquiring knowledge and for grounding the knowledge in a physical system and can serve as a common semantic framework for integrating information from different sources. RoboBrain is a large-scale computational system that learns from publicly available Internet resources, computer simulations, and real-life robot trials. It accumulates everything robotics into a comprehensive and interconnected knowledge base. Applications include prototyping for robotics research, household robots, and self-driving cars. The goal is as direct as the project's name—to create a centralised, always-online brain for robots to tap into. The project is dominated by Stanford University and Cornell University. And the project is supported by the National Science Foundation, the Office of Naval Research, the Army Research Office, Google, Microsoft, Qualcomm, the Alfred P. Sloan Foundation and the National Robotics Initiative, whose goal is to advance robotics to help make the United States more competitive in the world economy. MyRobots is a service for connecting robots and intelligent devices to the Internet. It can be regarded as a social network for robots and smart objects (i.e. Facebook for robots). With socialising, collaborating and sharing, robots can benefit from those interactions too by sharing their sensor information giving insight on their perspective of their current state. COALAS is funded by the INTERREG IVA France (Channel) – England European cross-border co-operation programme. The project aims to develop new technologies for disabled people through social and technological innovation and through the users' social and psychological integrity. The objective is to produce a cognitive ambient

    Read more →
  • Knowledge organization system

    Knowledge organization system

    Knowledge organization system (KOS), concept system, or concept scheme is the generic term used in knowledge organization (KO) for the selection of concepts with an indication of selected semantic relations. Despite their differences in type, coverage, and application, all KOS aim to support the organization of knowledge and information to facilitate their management and retrieval. KOS vary in complexity from simple sorted lists to complex relational networks. They represent both structural and functional features, and serve to eliminate ambiguity, control synonyms, establish relationships, and present properties. From their origins in library and information science (LIS), KOS have been applied to other domains and disciplines within science and industry, although scholarly research and debate remain primarily within the KO field. Challenges of KOS include ambiguity of terminology, repercussions of biased systems, and potential obsolescence. KOS can be expressed in RDF and RDFS as per the Simple Knowledge Organization System (SKOS) recommendation by W3C, which aims to enable the sharing and linking of KOS via the Web. One of the largest collections of KOS is the BARTOC registry. == Types == While different schema of KOS have been proposed, most are generally arranged in terms of the complexity of their construction and maintenance. Some scholars argue that organizing KOS on a spectrum oversimplifies the shared characteristics among them, and may even result in a non-ideal structure being chosen. The following types are not exhaustive, and are often not mutually-exclusive in practice. === Term lists === Term lists are the least structured form of KOS. They include lists, glossaries, dictionaries, and synonym rings. Authority files and gazetteers may also be considered term lists, however other scholars categorize them and directories as "metadata-like models". Examples include the Union List of Artist Names name authority file and the GeoNames gazetteer. === Categorization and classification === KOS that emphasize specific (and often hierarchical) structures include subject headings, taxonomies, categorization schema, and classification schema & systems. Despite inconsistent use of the terms "categorization" and "classification" in some literature, categorization is generally loosely-assembled grouping schema and may include attributes that are not mutually exclusive (or having fuzzy boundaries), while classification is related to the arrangement of non-overlapping and mutually-exclusive classes. Classification schema may be universal (such as Dewey Decimal Classification and Information Coding Classification) or domain-specific (such as the National Library of Medicine Classification). === Relationship models === The types of KOS with greatest complexity and which utilize connections between concepts include thesauri, semantic networks, and ontologies. One of the most prominent examples of a semantic network is WordNet. === Others === Certain structures proposed to be considered types of KOS—but are not consistently included in schema—include folksonomies, topic maps, web directory structures, publication organization systems, and bibliometric maps. Some KOS organize other KOS themselves—for instance, PeriodO is a gazetteer of periodization categories. == Applications == Some early KOS were developed as a support system for abstracting and indexing services to be used by specially-trained searchers. With the growth of information digitization, usability became increasingly accessible, and more complex structures were developed. Prominent examples of KOS outside of LIS include organism taxonomy in biology, the periodic table of elements in chemistry, SIC and NAICS classification systems for industry & business, and AGROVOC agricultural controlled vocabulary. == Challenges == The study and design of KOS is an ongoing topic of discussion among KO scholars. === Terminology === [There is] a serious lack of vocabulary control in the literature on controlled vocabulary. Inconsistency of terminology within the study of KOS is a common issue. For instance, "ontology" is used for both a specific type of KOS as well as a generic term for any KOS. The terms "taxonomy", "classification", and "categorization" are also sometimes used interchangeably. === Bias === As knowledge can be historically and culturally biased, scholars have also discussed how KOS themselves can perpetuate harmful practices or stereotypes. For example, a number of concerns and criticisms about the classification of mental disorders in the Diagnostic and Statistical Manual of Mental Disorders have been raised, contributing to ongoing revisions. Ethical and intentional design approaches have been proposed for multi-perspective KOS in efforts to mitigate bias and other harmful practices. === Obsolescence === The possible obsolescence of the thesaurus and other simpler KOS has been the topic of debate, especially in the face of increasingly complex ontologies, the growing usage of "Google-like retrieval systems", and the move of KO theory and research away from LIS and toward computer science. Supporters of thesauri argue its continued usefulness for metadata enrichment, vocabulary mapping, and web services, as well as its usage in specific domains such as corporate intranets and digital image libraries.

    Read more →
  • Best arm identification

    Best arm identification

    Best arm identification (BAI) is a sequential one-player game where the player has to find the best action (arm) among a list of actions (arms) by collecting information in the most efficient way. It is a multi-armed bandit game as a player only gets information about an arm by playing it. The most common objective in multi-armed bandit games is to minimize the regret (i.e., play the best action as much as possible), but in BAI, the goal is to find the best arm as efficiently as possible. This problem naturally arises in scenarios such as adaptive clinical trials where the number of patients is limited and the quantification of the confidence in a treatment is important. It also arises in hyperparameter optimization where the goal is to find the optimal choice of hyperparameters for an algorithm with the smallest possible number of experiments, as it can be costly in terms of time, energy, or money. == Stochastic multi-armed bandit == The stochastic multi-armed bandit (MAB) is a sequential game with one player and K {\displaystyle K} actions (arms). Each arm has an unknown probability distribution associated with it. At each turn, the player has to choose one action and receive an observation from the probability distribution associated with the arm. The more you play an arm, the more you get information on its probability distribution. === Best arm identification === In BAI the goal is to find the arm that has the probability distribution with the highest mean. BAI may be either fixed confidence or fixed horizon. In a fixed-confidence game, a confidence level δ {\displaystyle \delta } is fixed at the beginning of the game and the goal is to find the best arm with this confidence level in as few turns as possible. In a fixed horizon game, the number of turns T {\displaystyle T} is fixed, and the goal is to find the best arm with the highest possible confidence in T {\displaystyle T} turns. === Math formalisation === We have one player and K {\displaystyle K} actions (arms). Behind each arm k ∈ { 1 , … , K } {\displaystyle k\in \{1,\ldots ,K\}} lies an unknown distribution ν k {\displaystyle \nu _{k}} with mean μ k {\displaystyle \mu _{k}} . Each distribution ν k {\displaystyle \nu _{k}} belongs to a known family D {\displaystyle {\mathcal {D}}} (such as the set of Gaussian distributions or Bernoulli distributions). At each time step t {\displaystyle t} , the player selects an arm a t {\displaystyle a_{t}} and observes an independent sample X t ∼ ν a t {\displaystyle X_{t}\sim \nu _{a_{t}}} from the corresponding distribution. We will note μ ∗ := max μ a {\displaystyle \mu ^{}:=\max \mu _{a}} the highest mean. An arm a {\displaystyle a} that satisfies μ a = μ ∗ {\displaystyle \mu _{a}=\mu ^{}} is called an optimal arm; otherwise it is called suboptimal arm. In best arm identification (BAI) the objective is to identify an optimal arm. Two main settings for BAI appear in the literature: Fixed confidence: In this setting, one typically assumes that there exists a unique optimal arm. A confidence level δ ∈ ( 0 , 1 ) {\displaystyle \delta \in (0,1)} is specified at the beginning. The algorithm must stop at some finite stopping time τ δ < + ∞ {\displaystyle \tau _{\delta }<+\infty } and return an arm a ^ τ δ {\displaystyle {\hat {a}}_{\tau _{\delta }}} such that the probability of error is bounded: P ( a ^ τ δ ≠ a ∗ ) ≤ δ {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{})\leq \delta } . The objective is to minimize the expected sample complexity E [ τ δ ] {\displaystyle \mathbb {E} [\tau _{\delta }]} . Such a setting appears, for example, when a constraint on the confidence is required (for example, if we require a confidence level of 95%, so δ = 1 − 0.95 = 0.05 {\displaystyle \delta =1-0.95=0.05} ). Fixed horizon: In this setting, the number of samples T {\displaystyle T} is fixed in advance. The goal is to design an algorithm that minimizes the probability of misidentifying the optimal arm: P ( a ^ T ≠ a ∗ ) {\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{})} . This setting appears when the number of experiments is limited (for drug tests, the number of patients can be fixed in advance). === Example of simple modelling === In the case where we have K {\displaystyle K} treatments and we want to be sure with a confidence level of 95% which treatment is the best to heal a specific disease. Each treatment heals or does not heal the disease with a probability μ k {\displaystyle \mu _{k}} , which means that each distribution is a Bernoulli distribution, so D {\displaystyle {\mathcal {D}}} is the set of Bernoulli distributions. We can use a BAI algorithm to minimize E [ τ 0.05 ] {\displaystyle \mathbb {E} [\tau _{0.05}]} , the number of patients required to find the best treatment with probability 95%. == Applications == Best arm identification naturally arises in several practical domains: Adaptive clinical trials: The objective is to identify the most effective treatment based on sequentially collected patient data. Each treatment can be modeled as having an underlying distribution of outcomes. The goal is to identify the treatment with the highest expected outcome with high confidence (fixed confidence setting δ {\displaystyle \delta } ) while minimizing the number of drug test patients (minimise E [ τ δ ] {\displaystyle \mathbb {E} [\tau _{\delta }]} ), as it costs to pay patients for this and we would like to use as little as possible less effective drugs. Hyperparameter tuning: Selecting the best configuration for machine learning models efficiently by treating each hyperparameter setting as an arm. The goal is to find the best hyperparameter with as few experiments possible as experiments are costly in time and in energy == Fixed confidence level == In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level δ {\displaystyle \delta } while minimizing the expected number of samples. Any such algorithm requires two key components: Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time τ δ {\displaystyle \tau _{\delta }} and returns an arm a ^ τ δ {\displaystyle {\hat {a}}_{\tau _{\delta }}} such that P ( a ^ τ δ ≠ a ⋆ ) ≤ δ {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta } and P ( τ δ < + ∞ ) = 1 {\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1} . Sampling rule: A policy π {\displaystyle \pi } that, at each round t {\displaystyle t} , selects the next arm to sample a t {\displaystyle a_{t}} based on all previous observations ( a s , X s ) s < t {\displaystyle (a_{s},X_{s})_{s Read more →

  • Irish logarithm

    Irish logarithm

    The Irish logarithm was a system of number manipulation invented by Percy Ludgate for machine multiplication. The system used a combination of mechanical cams as lookup tables and mechanical addition to sum pseudo-logarithmic indices to produce partial products, which were then added to produce results. The technique is similar to Zech logarithms (also known as Jacobi logarithms), but uses a system of indices original to Ludgate. == Concept == Ludgate's algorithm compresses the multiplication of two single decimal numbers into two table lookups (to convert the digits into indices), the addition of the two indices to create a new index which is input to a second lookup table that generates the output product. Because both lookup tables are one-dimensional, and the addition of linear movements is simple to implement mechanically, this allows a less complex mechanism than would be needed to implement a two-dimensional 10×10 multiplication lookup table. Ludgate stated that he deliberately chose the values in his tables to be as small as he could make them; given this, Ludgate's tables can be simply constructed from first principles, either via pen-and-paper methods, or a systematic search using only a few tens of lines of program code. They do not correspond to either Zech logarithms, Remak indexes or Korn indexes. == Pseudocode == The following is an implementation of Ludgate's Irish logarithm algorithm in the Python programming language: Table 1 is taken from Ludgate's original paper; given the first table, the contents of Table 2 can be trivially derived from Table 1 and the definition of the algorithm. Note since that the last third of the second table is entirely zeros, this could be exploited to further simplify a mechanical implementation of the algorithm.

    Read more →
  • Class activation mapping

    Class activation mapping

    Class activation mapping methods are explainable AI (XAI) techniques used to visualize the regions of an input image that are the most relevant for a particular task, especially image classification, in convolutional neural networks (CNNs). These methods generate heatmaps by weighting the feature maps from a convolutional layer according to their relevance to the target class. In the field of artificial intelligence, generically defined as "the effort to automate intellectual tasks normally performed by humans", machine learning and deep learning were created. They both use statistical and computational methods to learn patterns from data, reducing the need for manually coded rules. Machine learning models are trained on input data and the known respective answers, learning the underlying patterns or structures present in the data. Traditional Machine learning algorithms employ manually designed feature sets, posing a direct link between machine learning designers and employed features. Deep learning is a subfield of machine learning, based on the concept of successive layers of representation, in which the data is progressively unfolded in different ways, to extract relevant and informative patterns in data analysis. Deep learning algorithms are defined as feature learning algorithms automatically learning hierarchical feature representations from raw data, extracting increasingly abstract features through multiple layers. CNNs are a specific architecture of deep learning models, designed to process spatially structured data, such as images, exploiting a series of convolution, non-linear activation and pooling operations to extract relevant features, contained in the so-called feature maps from input data. CNNs have demonstrated to be highly effective in a variety of computer vision and image processing tasks. CNNs (and deep learning models more broadly) are described as black boxes due to their complex and non-transparent internal layers of representation. The need for clearer indications on its internal working and decision-making process gave birth to XAI techniques. Among the proposed XAI techniques for computer vision tasks, Class activation mapping methods can show which pixels in an input image are important to the predicted logit for a class of interest, in a classification task. Class activation mapping methods were originally developed for class-discriminative scenarios to visualize which parts of the input image influenced the classification decision, namely to visually highlight the regions of those feature maps that contribute most strongly to the prediction of a given class. More advanced versions of these methods are not limited to image classification tasks, but have been extended also to several vision-related tasks, such as object detection, image captioning, visual question answering and image segmentation. == Background == The following methods laid the groundwork for the class activation maps approaches, forming the conceptual basis of using gradients to highlight class-discriminative regions. === Class model visualization and saliency maps for convolutional neural networks === The class model visualization and image-specific saliency maps approaches have been presented in the foundational work "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps" by Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman and it generalizes the deconvnet method by Zeiler and Fergus. Class model visualization synthesizes an artificial input image that strongly activates the output neurons associated with a target class. Given a trained, fixed model, this method starts with a zero-initialized image, backpropagates the gradients from the class score to the image pixels, updates the image pixels increasing the specific class scores and it repeats the pixel updating process, showing an encoded (idealized version) prototype of the class of interest. Image-specific class saliency visualization method provides a visual explanation by highlighting the most relevant pixels in an image for predicting a certain class C of interest. This is done by computing the gradient of the class score with respect to the input image, I 0 , {\displaystyle I_{0},} w = ∂ S C ∂ I | I 0 {\displaystyle w=\left.{\frac {\partial S_{C}}{\partial I}}\right|_{I_{0}}} approximating the model locally (around I 0 {\displaystyle I_{0}} ) as linear, using a first-order Taylor expansion: S C ( I ) ≈ w C T I + b {\displaystyle S_{C}(I)\approx w_{C}^{T}I+b} . The magnitude of w C {\displaystyle w_{C}} , the gradient, indicates the importancy of the pixels: larger gradients suggest greater influence on the prediction. Once the gradient is known, the saliency map is defined as the maximum absolute gradient across the color channels: M i j = m a x C | ∂ S C ∂ I i j C | {\displaystyle M_{ij}=max_{C}\left|{\frac {\partial S_{C}}{\partial I_{ij}^{C}}}\right|} resulting in an saliency map (i.e. heatmap). === Guided backpropagation === The concept of guided backpropagation can be traced for the first time in the paper by Springenberg et al. "Striving For Simplicity: The All Convolutional Net" and also this method builds upon the work by Zeiler and Fergus "Visualizing and Understanding Convolutional Networks". Guided backpropagation core is to understand what a CNN is learning, by visualizing the patterns that activate more strongly individual neurons (or filters), in architectures which do not rely on max-pooling layer. When propagating gradients back through a rectified linear unit (ReLU), guided backpropagation passes the gradient if and only if the input to the ReLU was positive (forward pass) and the output gradient is positive (backward signal), tackling both inactive neurons, negative gradients and suppressing the noise. The result displays sharper, high-resolution visualizations of what each neuron is responding to. Guided backpropagation represents a simple and practical method for model interpretability, helping understand how and where neural networks detect semantic concepts across layers. Moreover, it can be applied to any network architecture, due to its working principle. == Base versions == Class activation mapping and gradient-weighted class activation mapping are the original and most widely used methods for visual explanations in convolutional neural networks. These methods serve as the foundation for many later developments in explainable AI. Notation: In this article, the symbols i and j represent integer indices that disappear inside sums or averages, while x and y are the continuous (or up-sampled integer) coordinates of the final heat-map that is plotted. === Class activation mapping (CAM) === Class activation mapping (CAM) was the first, and the original, version of CAM methods, and it gave the name to the whole category. The approach was firstly introduced by Zhou et al. in their seminal work "Learning Deep Features for Discriminative Localization". This approach achieves class-specific heatmaps by modifying image classification CNN architectures, replacing fully-connected layers with convolutional layers and a final global average pooling layer. Its main scope is to localize and highlight discriminative regions of an input image that a CNN uses to identify a particular class, without needing explicit bounding box annotations. ==== Global average pooling (GAP) ==== Global average pooling (GAP) represents the key element in the original CAM approach. It is a dimensionality reduction technique and, similarly to other pooling layers, it allows the downsampling of the feature maps, calculating representative values for a specific region of the feature map. The particularity of GAP is that it calculates a single value for an entire feature map, significantly reducing the model dimensions. ==== Mathematical description ==== The mathematical description considers as its key the combination of convolutional and GAP layers. In CAM, it is mandatory to have the GAP layer after the last convolutional layer and before the final linear classifier layer. This last element of the architecture connects the output logits (the network predictions) y C {\displaystyle y^{C}} , to the GAP values, with its respective fine-tuned weights, w k C {\displaystyle w_{k}^{C}} . Considering A k {\displaystyle A^{k}} as the last feature maps of the last convolutional layer, GAP produces one value for each feature map, by averaging all the matrix elements (i, j) of the feature map: F k = 1 m n ∑ i = 1 m ∑ j = 1 n A i j k {\displaystyle F^{k}={\frac {1}{mn}}\sum _{i=1}^{m}\sum _{j=1}^{n}A_{ij}^{k}} with A k = [ A 11 k A 12 k ⋯ A 1 n k A 21 k A 22 k ⋯ A 2 n k ⋮ ⋮ ⋱ ⋮ A m 1 k A m 2 k ⋯ A m n k ] = { A i j k ∣ 1 ≤ i ≤ m , 1 ≤ j ≤ n } {\displaystyle A^{k}={\begin{bmatrix}A_{11}^{k}&A_{12}^{k}&\cdots &A_{1n}^{k}\\A_{21}^{k}&A_{22}^{k}&\cdots &A_{2n}^{k}\\\vdots &\vdots &\ddots &\vdots \\A_{m1}^{k}&A_{m2}^{k}&\cdots &A_{mn}^{k}\end{bmatrix}}=\left\{A_{

    Read more →
  • Master data

    Master data

    Master data represents "data about the business entities that provide context for business transactions". The most commonly found categories of master data are parties (individuals and organisations, and their roles, such as customers, suppliers, employees), products, financial structures (such as ledgers and cost centres) and locational concepts. Master data should be distinguished from reference data. While both provide context for business transactions, reference data is concerned with classification and categorisation, while master data is concerned with business entities. Master data is, by its nature, almost always non-transactional in nature. There exist edge cases where an organization may need to treat certain transactional processes and operations as "master data". This arises, for example, where information about master data entities, such as customers or products, is only contained within transactional data such as orders and receipts and is not housed separately. ISO 8000 is the international standard for data quality and data portability in master data. == Alternative definition == An alternative definition of the term master data is that it represents the business objects that contain the most valuable, agreed upon information shared across an organization. In this sense, it gives context to business activities and transactions, answering questions like who, what, when and how as well as expanding the ability to make sense of these activities through categorizations, groupings and hierarchies. It can cover relatively static reference data, transactional, unstructured, analytical, hierarchical and metadata. What constitutes master data under this definition is therefore not about an essential quality of the data (e.g. it is a business entity that provides context for business transactions), but rather about the context in which the organisation has decided to treat the data. == Externally-defined master data == For most organisations, most or all master data is defined and managed within that organisation. Some master data, however, may be externally defined and managed. This represents the single source of basic business data used across a marketplace, regardless of organisation or location. Thus, it can be used by multiple enterprises within a value chain, facilitating "integration of multiple data sources and literally [putting] everyone in the market on the same page." An example of market master data is the Universal Product Code (UPC) found on consumer products. == Master data management == Curating and managing master data is key to ensuring its quality and thus fitness for purpose. All aspects of an organisation, operational and analytical, are greatly dependent on the quality of an organization's master data. Master Data is therefore the focus of the information technology (IT) discipline of master data management (MDM). Without this discipline in place, organisations commonly encounter difficulties with having multiple versions of "the truth" about a business entity, both within individual applications, and distributed across applications.

    Read more →
  • Unrestricted algorithm

    Unrestricted algorithm

    An unrestricted algorithm is an algorithm for the computation of a mathematical function that puts no restrictions on the range of the argument or on the precision that may be demanded in the result. The idea of such an algorithm was put forward by C. W. Clenshaw and F. W. J. Olver in a paper published in 1980. In the problem of developing algorithms for computing, as regards the values of a real-valued function of a real variable (e.g., g[x] in "restricted" algorithms), the error that can be tolerated in the result is specified in advance. An interval on the real line would also be specified for values when the values of a function are to be evaluated. Different algorithms may have to be applied for evaluating functions outside the interval. An unrestricted algorithm envisages a situation in which a user may stipulate the value of x and also the precision required in g(x) quite arbitrarily. The algorithm should then produce an acceptable result without failure.

    Read more →
  • Time Warp Edit Distance

    Time Warp Edit Distance

    In the data analysis of time series, Time Warp Edit Distance (TWED) is a measure of similarity (or dissimilarity) between pairs of discrete time series, controlling the relative distortion of the time units of the two series using the physical notion of elasticity. In comparison to other distance measures, (e.g. DTW (dynamic time warping) or LCS (longest common subsequence problem)), TWED is a metric. Its computational time complexity is O ( n 2 ) {\displaystyle O(n^{2})} , but can be drastically reduced in some specific situations by using a corridor to reduce the search space. Its memory space complexity can be reduced to O ( n ) {\displaystyle O(n)} . It was first proposed in 2009 by P.-F. Marteau. == Definition == δ λ , ν ( A 1 p , B 1 q ) = M i n { δ λ , ν ( A 1 p − 1 , B 1 q ) + Γ ( a p ′ → Λ ) d e l e t e i n A δ λ , ν ( A 1 p − 1 , B 1 q − 1 ) + Γ ( a p ′ → b q ′ ) m a t c h o r s u b s t i t u t i o n δ λ , ν ( A 1 p , B 1 q − 1 ) + Γ ( Λ → b q ′ ) d e l e t e i n B {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{p},B_{1}^{q})=Min{\begin{cases}\delta _{\lambda ,\nu }(A_{1}^{p-1},B_{1}^{q})+\Gamma (a_{p}^{'}\to \Lambda )&{\rm {delete\ in\ A}}\\\delta _{\lambda ,\nu }(A_{1}^{p-1},B_{1}^{q-1})+\Gamma (a_{p}^{'}\to b_{q}^{'})&{\rm {match\ or\ substitution}}\\\delta _{\lambda ,\nu }(A_{1}^{p},B_{1}^{q-1})+\Gamma (\Lambda \to b_{q}^{'})&{\rm {delete\ in\ B}}\end{cases}}} whereas Γ ( α p ′ → Λ ) = d L P ( a p ′ , a p − 1 ′ ) + ν ⋅ ( t a p − t a p − 1 ) + λ {\displaystyle \Gamma (\alpha _{p}^{'}\to \Lambda )=d_{LP}(a_{p}^{'},a_{p-1}^{'})+\nu \cdot (t_{a_{p}}-t_{a_{p-1}})+\lambda } Γ ( α p ′ → b q ′ ) = d L P ( a p ′ , b q ′ ) + d L P ( a p − 1 ′ , b q − 1 ′ ) + ν ⋅ ( | t a p − t b q | + | t a p − 1 − t b q − 1 | ) {\displaystyle \Gamma (\alpha _{p}^{'}\to b_{q}^{'})=d_{LP}(a_{p}^{'},b_{q}^{'})+d_{LP}(a_{p-1}^{'},b_{q-1}^{'})+\nu \cdot (|t_{a_{p}}-t_{b_{q}}|+|t_{a_{p-1}}-t_{b_{q-1}}|)} Γ ( Λ → b q ′ ) = d L P ( b p ′ , b p − 1 ′ ) + ν ⋅ ( t b q − t b q − 1 ) + λ {\displaystyle \Gamma (\Lambda \to b_{q}^{'})=d_{LP}(b_{p}^{'},b_{p-1}^{'})+\nu \cdot (t_{b_{q}}-t_{b_{q-1}})+\lambda } Whereas the recursion δ λ , ν {\displaystyle \delta _{\lambda ,\nu }} is initialized as: δ λ , ν ( A 1 0 , B 1 0 ) = 0 , {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{0},B_{1}^{0})=0,} δ λ , ν ( A 1 0 , B 1 j ) = ∞ f o r j ≥ 1 {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{0},B_{1}^{j})=\infty \ {\rm {{for\ }j\geq 1}}} δ λ , ν ( A 1 i , B 1 0 ) = ∞ f o r i ≥ 1 {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{i},B_{1}^{0})=\infty \ {\rm {{for\ }i\geq 1}}} with a 0 ′ = b 0 ′ = 0 {\displaystyle a'_{0}=b'_{0}=0} === Implementations === An implementation of the TWED algorithm in C with a Python wrapper is available at TWED is also implemented into the Time Series Subsequence Search Python package (TSSEARCH for short) available at [1]. An R implementation of TWED has been integrated into the TraMineR, a R package for mining, describing and visualizing sequences of states or events, and more generally discrete sequence data. Additionally, cuTWED is a CUDA- accelerated implementation of TWED which uses an improved algorithm due to G. Wright (2020). This method is linear in memory and massively parallelized. cuTWED is written in CUDA C/C++, comes with Python bindings, and also includes Python bindings for Marteau's reference C implementation. ==== Python ==== Backtracking, to find the most cost-efficient path: ==== MATLAB ==== Backtracking, to find the most cost-efficient path:

    Read more →
  • Reconstruction from projections

    Reconstruction from projections

    The problem of reconstructing a multidimensional signal from its projection is uniquely multidimensional, having no 1-D counterpart. It has applications that range from computer-aided tomography to geophysical signal processing. It is a problem which can be explored from several points of view—as a deconvolution problem, a modeling problem, an estimation problem, or an interpolation problem. == Motivation and applications == Many fields in science and engineering use reconstruction from projections, especially in imaging. It is widely applied geophysical tomography, medical imaging and industrial radiography. For example, in a CT scanner, the 3D structure of the patient’s body being scanned is measured with beams going through the tissue and hitting a detector, giving a flat projection of the body from that angle. Multiple projections are put together to get an image of the position and shape of structures inside in 3D. == Problem statement and basics == A projection is a linear mapping of an M {\displaystyle M} dimensional signal into an N {\displaystyle N} dimensional one, where N ≤ M {\displaystyle N\leq M} . And the objective of reconstruction is to restore the M {\displaystyle M} dimensional signal based on the N {\displaystyle N} dimensional signal. The following case is a 2-D signal projected into 1D signal. The signal in the original coordinate is denoted as d ( u , v ) {\displaystyle d(u,v)} . Now consider a collimated beam of radiation coming from the opposite orientation of v ^ {\displaystyle {\hat {v}}} , producing a projection along u ^ {\displaystyle {\hat {u}}} . v ^ {\displaystyle {\hat {v}}} and u ^ {\displaystyle {\hat {u}}} are normal to each other, and the angle between u {\displaystyle u} and u ^ {\displaystyle {\hat {u}}} is theta. The signal obtained along u ^ {\displaystyle {\hat {u}}} axis is defined to be p θ ( u ^ ) {\displaystyle p_{\theta }({\hat {u}})} . The relationship between the original coordinate and the rotated coordinate is given by [ u ^ v ^ ] = [ cos ⁡ θ sin ⁡ θ − sin ⁡ θ cos ⁡ θ ] [ u v ] {\displaystyle {\begin{bmatrix}{\hat {u}}\\{\hat {v}}\end{bmatrix}}={\begin{bmatrix}\cos \theta &\sin \theta \\-\sin \theta &\cos \theta \end{bmatrix}}{\begin{bmatrix}u\\v\end{bmatrix}}} or inversely, [ u v ] = [ cos ⁡ θ − sin ⁡ θ sin ⁡ θ cos ⁡ θ ] [ u ^ v ^ ] {\displaystyle {\begin{bmatrix}u\\v\end{bmatrix}}={\begin{bmatrix}\cos \theta &-\sin \theta \\\sin \theta &\cos \theta \end{bmatrix}}{\begin{bmatrix}{\hat {u}}\\{\hat {v}}\end{bmatrix}}} Then we have p θ ( u ^ ) = ∫ − ∞ ∞ d ( u , v ) d v ^ = ∫ − ∞ ∞ d ( u ^ cos ⁡ ( θ ) − v ^ sin ⁡ ( θ ) , u ^ sin ⁡ ( θ ) + v ^ cos ⁡ ( θ ) ) d v ^ {\displaystyle p_{\theta }({\hat {u}})=\int _{-\infty }^{\infty }d(u,v)\,\mathrm {d} {\hat {v}}=\int _{-\infty }^{\infty }d({\hat {u}}\cos(\theta )-{\hat {v}}\sin(\theta ),{\hat {u}}\sin(\theta )+{\hat {v}}\cos(\theta ))\,\mathrm {d} {\hat {v}}} By varying theta, a large number of projections can be obtained. Given the projection-slice theorem, D ( Ω , θ ) {\displaystyle D(\Omega ,\theta )} ,the slice of the Fourier transform of d ( u , v ) {\displaystyle d(u,v)} at angle theta, is equivalent to P θ ( Ω ) {\displaystyle P_{\theta }(\Omega )} , the Fourier Transform of the projection p θ ( u ^ ) {\displaystyle p_{\theta }({\hat {u}})} . Therefore, the unknown d ( u , v ) {\displaystyle d(u,v)} can be obtained from its Fourier transform by means of the Fourier transform inversion integral d ( u , v ) = 1 4 π 2 ∫ − ∞ ∞ ∫ − ∞ ∞ D ( Ω 1 , Ω 2 ) e j Ω 1 u e j Ω 2 v d Ω 1 , Ω 2 {\displaystyle \mathrm {d} (u,v)={\frac {1}{4\pi ^{2}}}\int _{-\infty }^{\infty }\int _{-\infty }^{\infty }D(\Omega _{1},\Omega _{2})e^{j\Omega _{1}u}e^{j\Omega _{2}v}\,\mathrm {d} \Omega _{1},\Omega _{2}} = 1 4 π 2 ∫ 0 ∞ ∫ − π π D ( Ω , θ ) e j Ω u cos ⁡ ( θ ) e j Ω v s i n θ | Ω | d Ω d θ {\displaystyle ={\frac {1}{4\pi ^{2}}}\int _{0}^{\infty }\int _{-\pi }^{\pi }D(\Omega ,\theta )e^{j\Omega u\cos(\theta )}e^{j\Omega vsin\theta }{\begin{vmatrix}\Omega \end{vmatrix}}\,\mathrm {d} \Omega \mathrm {d} \theta } = 1 4 π 2 ∫ − π π ∫ 0 ∞ P θ ( Ω ) e j Ω ( u cos ⁡ θ + v sin ⁡ θ ) | Ω | d Ω d θ {\displaystyle ={\frac {1}{4\pi ^{2}}}\int _{-\pi }^{\pi }\int _{0}^{\infty }P_{\theta }(\Omega )e^{j}\Omega (u\cos \theta +v\sin \theta ){\begin{vmatrix}\Omega \end{vmatrix}}\,\mathrm {d} \Omega \mathrm {d} \theta } = 1 4 π 2 ∫ 0 π ( ∫ − ∞ ∞ P θ ( Ω ) | Ω | {\displaystyle ={\frac {1}{4\pi ^{2}}}\int _{0}^{\pi }(\int _{-\infty }^{\infty }P_{\theta }(\Omega ){\begin{vmatrix}\Omega \end{vmatrix}}} e j Ω u ^ d Ω ) d θ {\displaystyle e^{j\Omega {\hat {u}}}\mathrm {d} \Omega )\mathrm {d} \theta } By taking the inverse Fourier Transform and assuming g ( u ^ ) = F − 1 ( | Ω | 2 ) {\displaystyle g({\hat {u}})={\mathcal {F}}^{-1}({{\begin{vmatrix}\Omega \end{vmatrix}}^{2}})} , we get d ( u , v ) = ∑ i △ θ i [ p θ ( u ^ ) ∗ g θ i ( u ^ ) ] {\displaystyle d(u,v)=\sum _{i}\vartriangle \theta _{i}[p_{\theta }({\hat {u}})g_{\theta i}({\hat {u}})]} == Approaches == In practice, there are a wide variety of methods that are utilized, most of which are reconstruct 3-D information (volume) from 2-D signals (image). Typically used methods are CT, MRI, PET and SPECT. And the filtered back projection based on the principles introduced above are commonly applied. === Computed Tomography (CT) === In CT, a volume is formed by stacking the axial slices. The software cuts the volume in a different plane (usually orthogonal). Commonly, slice data is generated using an X-ray source that rotates around the object. X-ray sensors are positioned on the opposite side of the circle from the X-ray source. === Magnetic resonance imaging (MRI) === In MRI, energy from an oscillating magnetic field is temporarily applied to the patient at the appropriate resonance frequency. The protons (hydrogen atoms) emit a radio frequency signal which is measured by a receiving coil. The radio signal can be made to encode position information by varying the main magnetic field using gradient coils. === Positron emission tomography (PET) === The system detects pairs of gamma rays emitted indirectly by a positron-emitting radionuclide (tracer), which is introduced into the body on a biologically active molecule. Three-dimensional images of tracer concentration within the body are then constructed by computer analysis. In modern PET-CT scanners, three dimensional imaging is often accomplished with the aid of a CT X-ray scan performed on the patient during the same session, in the same machine. === Single-photon emission computed tomography (SPECT) === SPECT imaging is performed by using a gamma camera to acquire multiple 2-D images (projections) from multiple angles. Multiple projections are used to yield a 3-D data set. This data set may then be manipulated to show thin slices along any chosen axis of the body. SPECT is similar to PET in its use of radioactive tracer material and detection of gamma rays, while the tracers used in SPECT emit gamma radiation that is measured more directly.

    Read more →
  • Conceptualization (information science)

    Conceptualization (information science)

    In information science, a conceptualization is an abstract simplified view of some selected parts of the world, containing the objects, concepts, and other entities that are presumed of interest for some particular purpose and the relationships between them. An explicit specification of a conceptualization is an ontology, and it may occur that a conceptualization can be realized by several distinct ontologies. An ontological commitment in describing ontological comparisons is taken to refer to that subset of elements of an ontology shared with all the others. "An ontology is language-dependent", its objects and interrelations described within the language it uses, while a conceptualization is always the same, more general, its concepts existing "independently of the language used to describe it". The relation between these terms is shown in the figure to the right. Not all workers in knowledge engineering use the term "conceptualization", but instead refer to the conceptualization itself, or to the ontological commitment of all its realizations, as an overarching ontology. == Purpose and implementation == As a higher level abstraction, a conceptualization facilitates the discussion and comparison of its various ontologies, facilitating knowledge sharing and reuse. Each ontology based upon the same overarching conceptualization maps the conceptualization into specific elements and their relationships. The question then arises as to how to describe the "conceptualization" in terms that can encompass multiple ontologies. This issue has been called the Tower of Babel problem, that is, how can persons used to one ontology talk with others using a different ontology? This problem is easily grasped, but a general resolution is not at hand. It can be a "bottom-up" or a "top-down" approach, or something in between. However, in more artificial situations, such as information systems, the idea of a "conceptualization" and the "ontological commitment" of various ontologies that realize the "conceptualization" is possible. The formation of a conceptualization and its ontologies involves these steps: specification of the conceptualization ontology concepts: every definition involves the definitions of other terms relationships between the concepts: this step maps conceptual relationships onto the ontology structure groups of concepts: this step may lead to the creation of sub-ontologies formal description of ontology commitments, for example, to make them computer readable An example of moving conception into a language leading to a variety of ontologies is the expression of a process in pseudocode (a strictly structured form of ordinary language) leading to implementation in several different formal computer languages like Lisp or Fortran. The pseudocode makes it easier to understand the instructions and compare implementations, but the formal languages make possible the compilation of the ideas as computer instructions. Another example is mathematics, where a very general formulation (the analog of a conceptualization) is illustrated with "applications" that are more specialized examples. For instance, aspects of a function space can be illustrated using a vector space or a topological space that introduce interpretations of the "elements" of the conceptualization and additional relationships between them but preserve the connections required in the function space.

    Read more →
  • DONE

    DONE

    The Data-based Online Nonlinear Extremumseeker (DONE) algorithm is a black-box optimization algorithm. DONE models the unknown cost function and attempts to find an optimum of the underlying function. The DONE algorithm is suitable for optimizing costly and noisy functions and does not require derivatives. An advantage of DONE over similar algorithms, such as Bayesian optimization, is that the computational cost per iteration is independent of the number of function evaluations. == Methods == The DONE algorithm was first proposed by Hans Verstraete and Sander Wahls in 2015. The algorithm fits a surrogate model based on random Fourier features and then uses a well-known L-BFGS algorithm to find an optimum of the surrogate model. == Applications == DONE was first demonstrated for maximizing the signal in optical coherence tomography measurements, but has since then been applied to various other applications. For example, it was used to help extending the field of view in light sheet fluorescence microscopy.

    Read more →
  • Parchive

    Parchive

    Parchive (a portmanteau of parity archive, and formally known as Parity Volume Set Specification) is an erasure code system that produces par files for checksum verification of data integrity, with the capability to perform data recovery operations that can repair or regenerate corrupted or missing data. Parchive was originally written to solve the problem of reliable file sharing on Usenet, but it can be used for protecting any kind of data from data corruption, disc rot, bit rot, and accidental or malicious damage. Despite the name, Parchive uses more advanced techniques (specifically error correction codes) than simplistic parity methods of error detection. As of 2015, PAR1 is obsolete, PAR2 is mature for widespread use, and PAR3 is a discontinued experimental version developed by MultiPar author Yutaka Sawada. The original SourceForge Parchive project has been inactive since April 30, 2015. A new PAR3 specification has been worked on since April 28, 2019 by PAR2 specification author Michael Nahas. An alpha version of the PAR3 specification has been published on January 29, 2022 while the program itself is being developed. == History == Parchive was intended to increase the reliability of transferring files via Usenet newsgroups. Usenet was originally designed for informal conversations, and the underlying protocol, NNTP was not designed to transmit arbitrary binary data. Another limitation, which was acceptable for conversations but not for files, was that messages were normally fairly short in length and limited to 7-bit ASCII text. Various techniques were devised to send files over Usenet, such as uuencoding and Base64. Later Usenet software allowed 8 bit Extended ASCII, which permitted new techniques like yEnc. Large files were broken up to reduce the effect of a corrupted download, but the unreliable nature of Usenet remained. With the introduction of Parchive, parity files could be created that were then uploaded along with the original data files. If any of the data files were damaged or lost while being propagated between Usenet servers, users could download parity files and use them to reconstruct the damaged or missing files. Parchive included the construction of small index files (.par in version 1 and .par2 in version 2) that do not contain any recovery data. These indexes contain file hashes that can be used to quickly identify the target files and verify their integrity. Because the index files were so small, they minimized the amount of extra data that had to be downloaded from Usenet to verify that the data files were all present and undamaged, or to determine how many parity volumes were required to repair any damage or reconstruct any missing files. They were most useful in version 1 where the parity volumes were much larger than the short index files. These larger parity volumes contain the actual recovery data along with a duplicate copy of the information in the index files (which allows them to be used on their own to verify the integrity of the data files if there is no small index file available). In July 2001, Tobias Rieper and Stefan Wehlus proposed the Parity Volume Set specification, and with the assistance of other project members, version 1.0 of the specification was published in October 2001. Par1 used Reed–Solomon error correction to create new recovery files. Any of the recovery files can be used to rebuild a missing file from an incomplete download. Version 1 became widely used on Usenet, but it did suffer some limitations: It was restricted to handle at most 255 files. The recovery files had to be the size of the largest input file, so it did not work well when the input files were of various sizes. (This limited its usefulness when not paired with the proprietary RAR compression tool.) The recovery algorithm had a bug, due to a flaw in the academic paper on which it was based. It was strongly tied to Usenet and it was felt that a more general tool might have a wider audience. In January 2002, Howard Fukada proposed that a new Par2 specification should be devised with the significant changes that data verification and repair should work on blocks of data rather than whole files, and that the algorithm should switch to using 16 bit numbers rather than the 8 bit numbers that PAR1 used. Michael Nahas and Peter Clements took up these ideas in July 2002, with additional input from Paul Nettle and Ryan Gallagher (who both wrote Par1 clients). Version 2.0 of the Parchive specification was published by Michael Nahas in September 2002. Peter Clements then went on to write the first two Par2 implementations, QuickPar and par2cmdline. Abandoned since 2004, Paul Houle created phpar2 to supersede par2cmdline. Yutaka Sawada created MultiPar to supersede QuickPar. MultiPar uses par2j.exe (which is partially based on par2cmdline's optimization techniques) to use as MultiPar's backend engine. == Versions == Versions 1 and 2 of the file format are incompatible. (However, many clients support both.) === Par1 === For Par1, the files f1, f2, ..., fn, the Parchive consists of an index file (f.par), which is CRC type file with no recovery blocks, and a number of "parity volumes" (f.p01, f.p02, etc.). Given all of the original files except for one (for example, f2), it is possible to create the missing f2 given all of the other original files and any one of the parity volumes. Alternatively, it is possible to recreate two missing files from any two of the parity volumes and so forth. Par1 supports up to a total of 256 source and recovery files. === Par2 === Par2 files generally use this naming/extension system: filename.vol000+01.PAR2, filename.vol001+02.PAR2, filename.vol003+04.PAR2, filename.vol007+06.PAR2, etc. The number after the "+" in the filename indicates how many blocks it contains, and the number after "vol" indicates the number of the first recovery block within the PAR2 file. If an index file of a download states that 4 blocks are missing, the easiest way to repair the files would be by downloading filename.vol003+04.PAR2. However, due to the redundancy, filename.vol007+06.PAR2 is also acceptable. There is also an index file filename.PAR2, it is identical in function to the small index file used in PAR1. Par2 specification supports up to 32,768 source blocks and up to 65,535 recovery blocks. Input files are split into multiple equal-sized blocks so that recovery files do not need to be the size of the largest input file. Although Unicode is mentioned in the PAR2 specification as an option, most PAR2 implementations do not support Unicode. Directory support is included in the PAR2 specification, but most or all implementations do not support it. === Par3 === The Par3 specification was originally planned to be published as an enhancement over the Par2 specification. However, to date, it has remained closed source by specification owner Yutaka Sawada. A discussion on a new format started in the GitHub issue section of the maintained fork par2cmdline on January 29, 2019. The discussion led to a new format which is also named as Par3. The new Par3 format's specification is published on GitHub, but remains being an alpha draft as of January 28, 2022. The specification is written by Michael Nahas, the author of Par2 specification, with the help from Yutaka Sawada, animetosho and malaire. The new format claims to have multiple advantages over the Par2 format, including support for: More than 216 files and more than 216 blocks. Packing small files into one block, as well as deduplication when a block appears in multiple files. UTF-8 file names. File permissions, hard links, symbolic/soft links, and empty directories. Embedding PAR data inside other formats, like ZIP archives or ISO disk images. "Incremental backups", where a user creates recovery files for some file or folder, change some data, and create new recovery files reusing some of the older files. More error correction code algorithms (such as LDPC and sparse random matrix). BLAKE3 hashes, dropping support for the MD5 hashes used in PAR2. == Software == === Multi-platform === par2+tbb (GPLv2) — a concurrent (multithreaded) version of par2cmdline 0.4 using TBB. Only compatible with x86 based CPUs. It is available in the FreeBSD Ports system as par2cmdline-tbb. Original par2cmdline — (obsolete). Available in the FreeBSD Ports system as par2cmdline. par2cmdline maintained fork by BlackIkeEagle. par2cmdline-mt is another multithreaded version of par2cmdline using OpenMP, GPLv2, or later. Currently merged into BlackIkeEagle's fork and maintained there. ParPar (CC0) is a high performance, multithreaded PAR2 client and Node.js library. Does not support verifying or repair, it can currently only create PAR2 archives. par2deep (LGPL-3.0) — Produce, verify and repair par2 files recursively, both on the command line as well as with the aid of a graphical user interface. It is available in the Python Package Index system as par2deep. par2cron (MIT License) is an o

    Read more →
  • Inductive programming

    Inductive programming

    Inductive programming (IP) is a special area of automatic programming, covering research from artificial intelligence and programming, which addresses learning of typically declarative (logic or functional) and often recursive programs from incomplete specifications, such as input/output examples or constraints. Depending on the programming language used, there are several kinds of inductive programming. Inductive functional programming, which uses functional programming languages such as Lisp or Haskell, and most especially inductive logic programming, which uses logic programming languages such as Prolog and other logical representations such as description logics, have been more prominent, but other (programming) language paradigms have also been used, such as constraint programming or probabilistic programming. == Definition == Inductive programming incorporates all approaches which are concerned with learning programs or algorithms from incomplete (formal) specifications. Possible inputs in an IP system are a set of training inputs and corresponding outputs or an output evaluation function, describing the desired behavior of the intended program, traces or action sequences which describe the process of calculating specific outputs, constraints for the program to be induced concerning its time efficiency or its complexity, various kinds of background knowledge such as standard data types, predefined functions to be used, program schemes or templates describing the data flow of the intended program, heuristics for guiding the search for a solution or other biases. Output of an IP system is a program in some arbitrary programming language containing conditionals and loop or recursive control structures, or any other kind of Turing-complete representation language. In many applications the output program must be correct with respect to the examples and partial specification, and this leads to the consideration of inductive programming as a special area inside automatic programming or program synthesis, usually opposed to 'deductive' program synthesis, where the specification is usually complete. In other cases, inductive programming is seen as a more general area where any declarative programming or representation language can be used and we may even have some degree of error in the examples, as in general machine learning, the more specific area of structure mining or the area of symbolic artificial intelligence. A distinctive feature is the number of examples or partial specification needed. Typically, inductive programming techniques can learn from just a few examples. The diversity of inductive programming usually comes from the applications and the languages that are used: apart from logic programming and functional programming, other programming paradigms and representation languages have been used or suggested in inductive programming, such as functional logic programming, constraint programming, probabilistic programming, abductive logic programming, modal logic, action languages, agent languages and many types of imperative languages. == History == The early works of Plotkin, and his "relative least general generalization (rlgg)", had an enormous impact in inductive logic programming. There were some encouraging results on learning recursive Prolog programs such as quicksort from examples together with suitable background knowledge, for example with GOLEM. However, after initial success, the community got disappointed by limited progress about the induction of recursive programs with ILP less and less focusing on recursive programs and leaning more and more towards a machine learning setting with applications in relational data mining and knowledge discovery. In parallel to work in ILP, Koza proposed genetic programming in the early 1990s as a generate-and-test based approach to learning programs. The idea of genetic programming was further developed into the inductive programming system ADATE and the systematic-search-based system MagicHaskeller. Here again, functional programs are learned from sets of positive examples together with an output evaluation (fitness) function which specifies the desired input/output behavior of the program to be learned. The early work in grammar induction (also known as grammatical inference) is related to inductive programming, as rewriting systems or logic programs can be used to represent production rules. In fact, early works in inductive inference considered grammar induction and Lisp program inference as basically the same problem. The results in terms of learnability were related to classical concepts, such as identification-in-the-limit, as introduced in the seminal work of Gold. More recently, the language learning problem was addressed by the inductive programming community. In the recent years, the classical approaches have been resumed and advanced with great success. Therefore, the synthesis problem has been reformulated on the background of constructor-based term rewriting systems taking into account modern techniques of functional programming, as well as moderate use of search-based strategies and usage of background knowledge as well as automatic invention of subprograms. Many new and successful applications have recently appeared beyond program synthesis, most especially in the area of data manipulation, programming by example and cognitive modelling (see below). Other ideas have also been explored with the common characteristic of using declarative languages for the representation of hypotheses. For instance, the use of higher-order features, schemes or structured distances have been advocated for a better handling of recursive data types and structures; abstraction has also been explored as a more powerful approach to cumulative learning and function invention. One powerful paradigm that has been recently used for the representation of hypotheses in inductive programming (generally in the form of generative models) is probabilistic programming (and related paradigms, such as stochastic logic programs and Bayesian logic programming). == Application areas == The first workshop on Approaches and Applications of Inductive Programming (AAIP) Archived 2016-03-03 at the Wayback Machine held in conjunction with ICML 2005 identified all applications where "learning of programs or recursive rules are called for, [...] first in the domain of software engineering where structural learning, software assistants and software agents can help to relieve programmers from routine tasks, give programming support for end users, or support of novice programmers and programming tutor systems. Further areas of application are language learning, learning recursive control rules for AI-planning, learning recursive concepts in web-mining or for data-format transformations". Since then, these and many other areas have shown to be successful application niches for inductive programming, such as end-user programming, the related areas of programming by example and programming by demonstration, and intelligent tutoring systems. Other areas where inductive inference has been recently applied are knowledge acquisition, artificial general intelligence, reinforcement learning and theory evaluation, and cognitive science in general. There may also be prospective applications in intelligent agents, games, robotics, personalisation, ambient intelligence and human interfaces.

    Read more →
  • Pseudonymization

    Pseudonymization

    Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing. Pseudonymization (or pseudonymisation, the spelling under European guidelines) is one way to comply with the European Union's General Data Protection Regulation (GDPR) demands for secure data storage of personal information. Pseudonymized data can be restored to its original state with the addition of information which allows individuals to be re-identified. In contrast, anonymization is intended to prevent re-identification of individuals within the dataset. Clause 18, Module Four, footnote 2 of the Adoption by the European Commission of the Implementing Decisions (EU) 2021/914 "requires rendering the data anonymous in such a way that the individual is no longer identifiable by anyone ... and that this process is irreversible." == Impact of Schrems II ruling == The European Data Protection Supervisor (EDPS) on 9 December 2021 highlighted pseudonymization as the top technical supplementary measure for Schrems II compliance. Less than two weeks later, the EU Commission highlighted pseudonymization as an essential element of the equivalency decision for South Korea, which is the status that was lost by the United States under the Schrems II ruling by the Court of Justice of the European Union (CJEU). The importance of GDPR-compliant pseudonymization increased dramatically in June 2021 when the European Data Protection Board (EDPB) and the European Commission highlighted GDPR-compliant pseudonymization as the state-of-the-art technical supplementary measure for the ongoing lawful use of EU personal data when using third country (i.e., non-EU) cloud processors or remote service providers under the "Schrems II" ruling by the CJEU. Under the GDPR and final EDPB Schrems II Guidance, the term pseudonymization requires a new protected "state" of data, producing a protected outcome that: Protects direct, indirect, and quasi-identifiers, together with characteristics and behaviors; Protects at the record and data set level versus only the field level so that the protection travels wherever the data goes, including when it is in use; and Protects against unauthorized re-identification via the mosaic effect by generating high entropy (uncertainty) levels by dynamically assigning different tokens at different times for various purposes. The combination of these protections is necessary to prevent the re-identification of data subjects without the use of additional information kept separately, as required under GDPR Article 4(5) and as further underscored by paragraph 85(4) of the final EDPB Schrems II guidance: Article 4(5) "Definitions" of the GDPR defines pseudonymization as "the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person." "Use Case 2: Transfer of pseudonymised Data Paragraph 85(4)" of the final EDPB Schrems II Guidance requires that “the controller has established by means of a thorough analysis of the data in question – taking into account any information that the public authorities of the recipient country may be expected to possess and use – that the pseudonymised personal data cannot be attributed to an identified or identifiable natural person even if cross-referenced with such information." GDPR-compliant pseudonymization requires that data is "anonymous" in the strictest EU sense of the word – globally anonymous – but for the additional information held separately and made available under controlled conditions as authorized by the data controller for permitted re-identification of individual data subjects. Clause 18, Module Four, footnote 2 of the Adoption by the European Commission of the Implementing Decision (EU) 2021/914 "requires rendering the data anonymous in such a way that the individual is no longer identifiable by anyone, in line with recital 26 of Regulation (EU) 2016/679, and that this process is irreversible." Before the Schrems II ruling, pseudonymization was a technique used by security experts or government officials to hide personally identifiable information to maintain data structure and privacy of information. Some common examples of sensitive information include postal code, location of individuals, names of individuals, race and gender, etc. After the Schrems II ruling, GDPR-compliant pseudonymization must satisfy the above-noted elements as an "outcome" versus merely a technique. == Data fields == The choice of which data fields are to be pseudonymized is partly subjective. Less selective fields, such as birth date or postal code are often also included because they are usually available from other sources and therefore make a record easier to identify. Pseudonymizing these less identifying fields removes most of their analytic value and is therefore normally accompanied by the introduction of new derived and less identifying forms, such as year of birth or a larger postal code region. Data fields that are less identifying, such as date of attendance, are usually not pseudonymized. This is because too much statistical utility is lost in doing so, not because the data cannot be identified. For example, given prior knowledge of a few attendance dates it is easy to identify someone's data in a pseudonymized dataset by selecting only those people with that pattern of dates. This is an example of an inference attack. The weakness of pre-GDPR pseudonymized data to inference attacks is commonly overlooked. A famous example is the AOL search data scandal. The AOL example of unauthorized re-identification did not require access to separately kept "additional information" that was under the control of the data controller as is now required for GDPR-compliant pseudonymization, outlined below under the section "New Definition for Pseudonymization Under GDPR". Protecting statistically useful pseudonymized data from re-identification requires: a sound information security base controlling the risk that the analysts, researchers or other data workers cause a privacy breach The pseudonym allows tracking back of data to its origins, which distinguishes pseudonymization from anonymization, where all person-related data that could allow backtracking has been purged. Pseudonymization is an issue in, for example, patient-related data that has to be passed on securely between clinical centers. The application of pseudonymization to e-health intends to preserve the patient's privacy and data confidentiality. It allows primary use of medical records by authorized health care providers and privacy preserving secondary use by researchers. In the US, HIPAA provides guidelines on how health care data must be handled and data de-identification or pseudonymization is one way to simplify HIPAA compliance. However, plain pseudonymization for privacy preservation often reaches its limits when genetic data are involved (see also genetic privacy). Due to the identifying nature of genetic data, depersonalization is often not sufficient to hide the corresponding person. Potential solutions are the combination of pseudonymization with fragmentation and encryption. An example of application of pseudonymization procedure is creation of datasets for de-identification research by replacing identifying words with words from the same category (e.g. replacing a name with a random name from the names dictionary), however, in this case it is in general not possible to track data back to its origins. == New definition under GDPR == Effective as of May 25, 2018, the EU General Data Protection Regulation (GDPR) defines pseudonymization for the very first time at the EU level in Article 4(5). Under Article 4(5) definitional requirements, data is pseudonymized if it cannot be attributed to a specific data subject without the use of separately kept "additional information". Pseudonymized data embodies the state of the art in Data Protection by Design and by Default because it requires protection of both direct and indirect identifiers (not just direct). GDPR Data Protection by Design and by Default principles as embodied in pseudonymization require protection of both direct and indirect identifiers so that personal data is not cross-referenceable (or re-identifiable) via the "mosaic effect" without access to "additional information" that is kept separately by the controller. Because access to separately kept "additional information" is required

    Read more →
  • Recording format

    Recording format

    A recording format is a format for encoding data for storage on a storage medium. The format can be container information such as sectors on a disk, or user/audience information (content) such as analog stereo audio. Multiple levels of encoding may be achieved in one format. For example, a text encoded page may contain HTML and XML encoding, combined in a plain text file format, using either EBCDIC or ASCII character encoding, on a UDF digitally formatted disk. In electronic media, the primary format is the encoding that requires hardware to interpret (decode) data; while secondary encoding is interpreted by secondary signal processing methods, usually computer software. == Recording container formats == A container format is a system for dividing physical storage space or virtual space for data. Data space can be divided evenly by a system of measurement, or divided unevenly with meta data. A grid may divide physical or virtual space with physical or virtual (dividers) borders, evenly or unevenly. Just as a physical container (such as a file cabinet) is divided by physical borders (such as drawers and file folders), data space is divided by virtual borders. Meta data such as a unit of measurement, address, or meta tags act as virtual borders in a container format. A template may be considered an abstract format for containing a solution as well as the content itself. Systems of measurement Metric system Geographic coordinate system Page grid Film formats Audio data format Video tape format Disk format File format Meta data Text formatting Template Data structure == Raw content formats == A raw content format is a system of converting data to displayable information. Raw content formats may either be recorded in secondary signal processing methods such as a software container format (e.g. digital audio, digital video) or recorded in the primary format. A primary raw content format may be directly observable (e.g. image, sound, motion, smell, sensation) or physical data which only requires hardware to display it, such as a phonographic needle and diaphragm or a projector lamp and magnifying glass.

    Read more →