AI For Business Owners Course

AI For Business Owners Course — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Self-supervised learning

    Self-supervised learning

    Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects. During SSL, the model learns in two steps. First, the task is solved based on an auxiliary or pretext classification task using pseudo-labels, which help to initialize the model parameters. Next, the actual task is performed with supervised or unsupervised learning. Self-supervised learning has produced promising results in recent years, and has found practical application in fields such as audio processing, and is being used by Facebook and others for speech recognition. == Pseudo-labels == Pseudo-labels are automatically generated labels that a model assigns to unlabeled data based on its own predictions. They are widely used in self-supervised and semi-supervised learning, where ground-truth annotations are limited or unavailable. By treating predicted labels as surrogate ground truth, learning algorithms can make use of large quantities of unlabeled data in the training process. Pseudo-labeling also plays an important role in systems that must adapt to concept drift, where the statistical properties of the data change over time. In these scenarios, the model may detect that an incoming instance deviates from previously learned behavior. The system then generates a classification result for that instance, and this predicted class is used as a pseudo-label for updating or retraining model components that are becoming outdated. This approach enables continuous adaptation in dynamic environments without requiring manual annotation. In many adaptive learning pipelines, pseudo-labels are chosen when the classifier produces sufficiently confident predictions, reducing the risk of propagating errors. These pseudo-labeled instances are then incorporated into training to refresh or evolve the model's understanding of emerging data patterns, particularly when existing components show signs of “aging” due to drift or distributional shifts. This strategy reduces reliance on manual labeling while helping maintain long-term model performance. == Types == === Autoassociative self-supervised learning === Autoassociative self-supervised learning is a specific category of self-supervised learning where a neural network is trained to reproduce or reconstruct its own input data. In other words, the model is tasked with learning a representation of the data that captures its essential features or structure, allowing it to regenerate the original input. The term "autoassociative" comes from the fact that the model is essentially associating the input data with itself. This is often achieved using autoencoders, which are a type of neural network architecture used for representation learning. Autoencoders consist of an encoder network that maps the input data to a lower-dimensional representation (latent space), and a decoder network that reconstructs the input from this representation. The training process involves presenting the model with input data and requiring it to reconstruct the same data as closely as possible. The loss function used during training typically penalizes the difference between the original input and the reconstructed output (e.g. mean squared error). By minimizing this reconstruction error, the autoencoder learns a meaningful representation of the data in its latent space. === Contrastive self-supervised learning === For a binary classification task, training data can be divided into positive examples and negative examples. Positive examples are those that match the target. For example, if training a classifier to identify birds, the positive training data would include images that contain birds. Negative examples would be images that do not. Contrastive self-supervised learning uses both positive and negative examples. The loss function in contrastive learning is used to minimize the distance between positive sample pairs, while maximizing the distance between negative sample pairs. An early example uses a pair of 1-dimensional convolutional neural networks to process a pair of images and maximize their agreement. Contrastive Language-Image Pre-training (CLIP) allows joint pretraining of a text encoder and an image encoder, such that a matching image-text pair have image encoding vector and text encoding vector that span a small angle (having a large cosine similarity). InfoNCE (Noise-Contrastive Estimation) is a method to optimize two models jointly, based on Noise Contrastive Estimation (NCE). Given a set X = { x 1 , … x N } {\displaystyle X=\left\{x_{1},\ldots x_{N}\right\}} of N {\displaystyle N} random samples containing one positive sample from p ( x t + k ∣ c t ) {\displaystyle p\left(x_{t+k}\mid c_{t}\right)} and N − 1 {\displaystyle N-1} negative samples from the 'proposal' distribution p ( x t + k ) {\displaystyle p\left(x_{t+k}\right)} , it minimizes the following loss function: L N = − E X [ log ⁡ f k ( x t + k , c t ) ∑ x j ∈ X f k ( x j , c t ) ] {\displaystyle {\mathcal {L}}_{\mathrm {N} }=-\mathbb {E} _{X}\left[\log {\frac {f_{k}\left(x_{t+k},c_{t}\right)}{\sum _{x_{j}\in X}f_{k}\left(x_{j},c_{t}\right)}}\right]} === Non-contrastive self-supervised learning === Non-contrastive self-supervised learning (NCSSL) uses only positive examples. Counterintuitively, NCSSL converges on a useful local minimum rather than reaching a trivial solution, with zero loss. For the example of binary classification, it would trivially learn to classify each example as positive. Effective NCSSL requires an extra predictor on the online side that does not back-propagate on the target side. === Joint-Embedding and Predictive Architectures === A major class of self-supervised learning moves beyond contrastive pairs, instead maximizing the agreement between views while preventing collapse through statistical constraints. Rooted in Deep Canonical Correlation Analysis (Deep CCA), this approach includes Joint-Embedding Architectures (JEA) like Barlow Twins and VICReg, which enforce covariance constraints to learn invariant representations without negative sampling. Deep Latent Variable Path Modelling (DLVPM) generalizes this to multimodal systems, using path models to enforce correlation and orthogonality across diverse data types. In 2022 Yann LeCun introduced Joint-Embedding Predictive Architectures (JEPA) as a step towards decision making, reasoning, and autonomous human intelligence in machines, including self-improvement through autonomous learning. Founded in representation learning, LeCun included the concept of a “world model” in JEPA which aims to enable machines to replicate human intellect by providing machines with a concept for the world in which they exist. Unlike autoencoders, JEPAs operate entirely in latent space, avoiding pixel-level noise to focus on semantic structure. Rather than just learning invariance, JEPAs learn by predicting masked latent representations from visible context. JEPA has been applied to domains such as image analysis, audio processing, and motion in images and video. == Comparison with other forms of machine learning == SSL belongs to supervised learning methods insofar as the goal is to generate a classified output from the input. At the same time, however, it does not require the explicit use of labeled input-output pairs. Instead, correlations, metadata embedded in the data, or domain knowledge present in the input are implicitly and autonomously extracted from the data. These supervisory signals, extracted from the data, can then be used for training. SSL is similar to unsupervised learning in that it does not require labels in the sample data. Unlike unsupervised learning, however, learning is not done using inherent data structures. Semi-supervised learning combines supervised and unsupervised learning, requiring only a small portion of the learning data be labeled. In transfer learning, a model designed for one task is reused on a different task. Training an autoencoder intrinsically constitutes a self-supervised process, because the output pattern needs to become an optimal reconstruction of the input pattern itself. However, in current jargon, the term 'self-supervised' often refers to tasks based on a pretext-task training setup

    Read more →
  • Information flow

    Information flow

    In discourse-based grammatical theory, information flow is any tracking of referential information by speakers. Information may be new, i.e., just introduced into the conversation; given, i.e., already active in the speakers' consciousness; or old, i.e., no longer active. The various types of activation, and how these are defined, are model-dependent. Information flow affects grammatical structures such as: Word order (topic, focus, and afterthought constructions). Active, passive, or middle voice. Choice of deixis, such as articles; "medial" deictics such as Spanish ese and Japanese sore are generally determined by the familiarity of a referent rather than by physical distance. Overtness of information, such as whether an argument of a verb is indicated by a lexical noun phrase, a pronoun, or not mentioned at all. Clefting: Splitting a single clause into two clauses, each with its own verb, e.g. ‘The chicken turtles tasted like chicken.’ becomes ‘It was the chicken turtle | that tasted like chicken.’ In this case, clefting is used to shift the focus of the sentence to the subject, the chicken turtle. Front focus: Placing at the start (front) of a sentence information that would normally occur later in the sentence, to give it extra prominence. For example, in pop culture, Yoda's speech often utilizes such syntactic construction, such as when he says 'much to learn you still have' to Luke Skywalker. End focus (or end weight): Given or familiar information followed by new information. This gives prominence to the final part of the sentences and can enable suspense to build, e.g. ‘Through the door came a gigantic wolf’.(Umer Prince)

    Read more →
  • Enterprise data planning

    Enterprise data planning

    Enterprise data planning is the starting point for enterprise wide change. It states the destination and describes how you will get there. It defines benefits, costs and potential risks. It provides measures to be used along the way to judge progress and adjust the journey according to changing circumstances. Data is fundamental to investment enterprises. Effective, economic management of data underpins operations and enables transformations needed to satisfy customer demands, competition and regulation. Data warehouse(s) and other aspects of the overall data architecture are critical to the enterprise. EDMworks has created a strategic data planning approach for the Investment Sector. It consists of a planning process, planning intranets, templates and training materials. EDMworks planning process is based on the belief that extensive domain knowledge significantly shortens planning iterations and enables progressively higher quality plans to be produced and implemented. This approach drives the development of an effective and economic enterprise data architecture. Enterprise data planning is based on proven business disciplines. Key architectural layers for data and applications are then added in order to provide an enterprise wide understanding of the uses and interdependencies of data. This enables the definition of the core components of the EDM plan: Industry structure and business objectives Assessment of systems and services Target architecture for applications, data and infrastructure Target organization structures Systems, database, infrastructure and organizational plans Business case, costs, benefits, results and risks. EDMworks uses several components from the Open Systems Group TOGAF enterprise systems planning process. TOGAF acts as an extension to good business planning methods to provide a framework for the development of the systems and data architectural components. == History == James Martin was one of the pathfinders in data planning methodologies. He was one of the first to identify data as being an enterprise wide asset that required management. He developed a series of tools and methods to support that process. Most of the large consulting firms developed their own methods to address the same basic issue. Frequently, their approaches were incorporated into their own branded system development methodologies that encompassed the complete systems development life-cycle. Others, such as Ed Tozer, developed more focused offerings that dealt with the complexities of extracting key business needs from senior management and then defining relevant architectural visions for the specific enterprise. From these various sources, the concepts of Business, Data, Applications and Technology Architectures emerged. The Open Group Architectural Framework (TOGAF) has taken this work forward and has established a sound method in TOGAF version 9. EDMworks approach is to adopt these planning and architectural practices as a basis and then add two additional dimensions to the planning and implementation focus: Domain knowledge of the Investments sector. Investments is a complex global industry with a common set of characteristics about clients, information vendors, competition and regulation. Domain knowledge significantly improves the quality of the planning and implementation processes Development of people and teams. Change is a major feature of in any Enterprise Data Management program and people and teams both need development in order to make EDM effective throughout an organization.

    Read more →
  • Information seeking

    Information seeking

    Information seeking is the process or activity of attempting to obtain information in both human and technological contexts. Information seeking is related to, but different from, information retrieval (IR). == Compared to information retrieval == Traditionally, IR tools have been designed for IR professionals to enable them to effectively and efficiently retrieve information from a source. It is assumed that the information exists in the source and that a well-formed query will retrieve it (and nothing else). It has been argued that laypersons' information seeking on the internet is very different from information retrieval as performed within the IR discourse. Yet, internet search engines are built on IR principles. Since the late 1990s a body of research on how casual users interact with internet search engines has been forming, but the topic is far from fully understood. IR can be said to be technology-oriented, focusing on algorithms and issues such as precision and recall. Information seeking may be understood as a more human-oriented and open-ended process than information retrieval. In information seeking, one does not know whether there exists an answer to one's query, so the process of seeking may provide the learning required to satisfy one's information need. == In different contexts == Much library and information science (LIS) research has focused on the information-seeking practices of practitioners within various fields of professional work. Studies have been carried out into the information-seeking behaviors of librarians, academics, medical professionals, engineers, lawyers and mini-publics(among others). Much of this research has drawn on the work done by Leckie, Pettigrew (now Fisher) and Sylvain, who in 1996 conducted an extensive review of the LIS literature (as well as the literature of other academic fields) on professionals' information seeking. The authors proposed an analytic model of professionals' information seeking behaviour, intended to be generalizable across the professions, thus providing a platform for future research in the area. The model was intended to "prompt new insights... and give rise to more refined and applicable theories of information seeking" (1996, p. 188). The model has been adapted by Wilkinson (2001) who proposes a model of the information seeking of lawyers. Recent studies in this topic address the concept of information-gathering that "provides a broader perspective that adheres better to professionals' work-related reality and desired skills." (Solomon & Bronstein, 2021). == Theories of information-seeking behavior == A variety of theories of information behavior – e.g. Zipf's Principle of Least Effort, Brenda Dervin's Sense Making, Elfreda Chatman's Life in the Round – seek to understand the processes that surround information seeking. In addition, many theories from other disciplines have been applied in investigating an aspect or whole process of information seeking behavior. A review of the literature on information seeking behavior shows that information seeking has generally been accepted as dynamic and non-linear (Foster, 2005; Kuhlthau 2006). People experience the information search process as an interplay of thoughts, feelings and actions (Kuhlthau, 2006). Donald O. Case (2007) also wrote a good book that is a review of the literature. Information seeking has been found to be linked to a variety of interpersonal communication behaviors beyond question-asking, to include strategies such as candidate answers. Robinson's (2010) research suggests that when seeking information at work, people rely on both other people and information repositories (e.g., documents and databases), and spend similar amounts of time consulting each (7.8% and 6.4% of work time, respectively; 14.2% in total). However, the distribution of time among the constituent information seeking stages differs depending on the source. When consulting other people, people spend less time locating the information source and information within that source, similar time understanding the information, and more time problem solving and decision making, than when consulting information repositories. Furthermore, the research found that people spend substantially more time receiving information passively (i.e., information that they have not requested) than actively (i.e., information that they have requested), and this pattern is also reflected when they provide others with information. == Wilson's nested model of conceptual areas == The concepts of information seeking, information retrieval, and information behaviour are objects of investigation of information science. Within this scientific discipline a variety of studies has been undertaken analyzing the interaction of an individual with information sources in case of a specific information need, task, and context. The research models developed in these studies vary in their level of scope. Wilson (1999) therefore developed a nested model of conceptual areas, which visualizes the interrelation of the here mentioned central concepts. Wilson defines models of information behavior to be "statements, often in the form of diagrams, that attempt to describe an information-seeking activity, the causes and consequences of that activity, or the relationships among stages in information-seeking behaviour" (1999: 250).

    Read more →
  • Fred (chatbot)

    Fred (chatbot)

    Fred, or FRED, was an early chatbot written by Robby Garner. == History == The name Fred was initially suggested by Karen Lindsey, and then Robby jokingly came up with an acronym, "Functional Response Emulation Device." Fred has also been implemented as a Java application by Paco Nathan called JFRED Archived 2008-08-24 at the Wayback Machine. Fred Chatterbot is designed to explore Natural Language communications between people and computer programs. In particular, this is a study of conversation between people and ways that a computer program can learn from other people's conversations to make its own conversations. Fred used a minimalistic "stimulus-response" approach. It worked by storing a database of statements and their responses, and made its own reply by looking up the input statements made by a user and then rendering the corresponding response from the database. This approach simplified the complexity of the rule base, but required expert coding and editing for modifications. Fred was a predecessor to Albert One, which Garner used in 1998 and 1999 to win the Loebner Prize.

    Read more →
  • Lion algorithm

    Lion algorithm

    Lion algorithm (LA) is one among the bio-inspired (or) nature-inspired optimization algorithms (or) that are mainly based on meta-heuristic principles. It was first introduced by B. R. Rajakumar in 2012 in the name, Lion’s Algorithm. It was further extended in 2014 to solve the system identification problem. This version was referred as LA, which has been applied by many researchers for their optimization problems. == Inspiration from lion’s social behaviour == Lions form a social system called a "pride", which consists of 1–3 pair of lions. A pride of lions shares a common area known as territory in which a dominant lion is called as territorial lion. The territorial lion safeguards its territory from outside attackers, especially nomadic lions. This process is called territorial defense. It protects the cubs till they become sexually matured. The maturity period is about 2–4 years. The pride undergoes survival fights to protect its territory and the cubs from nomadic lions. Upon getting defeated by the nomadic lions, the dominating nomadic lion takes the role of territorial lion by killing or driving out the cubs of the pride. The lioness of the pride give birth to cubs though the new territorial lion. When the cubs of the pride mature and considered to be stronger than the territorial lion, they take over the pride. This process is called territorial take-over. If territorial take-over happens, either the old territorial lion, which is considered to be laggard, is driven out or it leaves the pride. The stronger lions and lioness form the new pride and give birth to their own cubs == Terminology == In the LA, the terms that are associated with lion’s social system are mapped to the terminology of optimization problems. Few of such notable terms are related here. Lion: A potential solution to be generated or determined as optimal (or) near-optimal solution of the problem. The lion can be a territorial lion and lioness, cubs and nomadic lions that represent the solution based on the processing steps of the LA. Territorial lion: The strongest solution of the pride that tends to meet the objective function. Nomadic lion: A random solution, sometimes termed as nomad, to facilitate the exploration principle Laggard lion: Poor solutions that are failed in the survival fight. Pride: A pool of potential solutions i.e. a lion, lioness and their cubs, that are potential solutions of the search problem. Fertility evaluation: A process of evaluating whether the territorial lion and lioness are able to provide potential solutions in the future generations i.e. It ensures that the lion or lioness converge at every generation. Survival fight: It is a greedy selection process, which is often carried out between the pride and nomadic lion. == Algorithm == The steps involved in LA are given below: Pride Generation: Generate X m a l e {\displaystyle X^{male}} , X f e m a l e {\displaystyle X^{female}} and X 1 n o m a d {\displaystyle X_{1}^{nomad}} Determine f ( X m a l e ) {\displaystyle f(X^{male})} , f ( X f e m a l e ) {\displaystyle f(X^{female})} , f ( X 1 n o m a d ) {\displaystyle f(X_{1}^{nomad})} Initialize f r e f {\displaystyle f^{ref}} as f ( X m a l e ) {\displaystyle f(X^{male})} and N g {\displaystyle N_{g}} as 0 Memorize X m a l e {\displaystyle X^{male}} and X f e m a l e {\displaystyle X^{female}} Apply Fertility evaluation Process Generation of cubpool by mating Gender clustering: Define X c u b m a l e {\displaystyle X_{cub}^{male}} and X c u b f e m a l e {\displaystyle X_{cub}^{female}} Initialize a g e c u b {\displaystyle age_{cub}} as zero Apply Cub growth function Territorial defense: If X m a l e {\displaystyle X^{male}} (or pride) fails in the survival fight i.e. X 1 n o m a d {\displaystyle X_{1}^{nomad}} defeats the pride, go to step 4, else continue Increase a g e c u b {\displaystyle age_{cub}} by 1 and check whether cub attains maturity i.e., if a g e c u b > a g e m a x {\displaystyle age_{cub}>age_{max}} , go to Step 9, else continue Territorial takeover: If X c u b m a l e {\displaystyle X_{cub}^{male}} and X c u b f e m a l e {\displaystyle X_{cub}^{female}} are found to be closer to optimal solution, update X m a l e {\displaystyle X^{male}} and X f e m a l e {\displaystyle X^{female}} Increment N g {\displaystyle N_{g}} by 1 Repeat from Step 5, if termination criterion is not violated, else return X m a l e {\displaystyle X^{male}} as the near-optimal solution == Variants == The LA has been further taken forward to adopt in different problem areas. According to the characteristics of the problem area, significant amendment has been done in the processes and the models used in the LA. Accordingly, diverse variants have been developed by the researchers. They can be broadly grouped as hybrid LAs and non-hybrid LAs. Hybrid LAs are the LAs that are amended by the principle of other meta-heuristics, whereas the Non-hybrid LAs take any scientific amendment inside its operation that are felt to be essential to attend the respective problem area. == Applications == LA is applied in diverse engineering applications that range from network security, text mining, image processing, electrical systems, data mining and many more. Few of the notable applications are discussed here. Networking applications: In WSN, LA is used to solve the cluster head selection problem by determining optimal cluster head. Route discovery problem in both the VANET and MANET are also addressed by the LA in the literature. It is also used to detect attacks in advanced networking scenarios such as Software-Defined Networks (SDN) Power Systems: LA has attended generation rescheduling problem in a deregulated environment, optimal localization and sizing of FACTS devices for power quality enhancement and load-frequency controlling problem Cloud computing: LA is used in optimal container-resource allocation problem in cloud environment and cloud security

    Read more →
  • Magic state distillation

    Magic state distillation

    Magic state distillation is a method for creating more accurate quantum states from multiple noisy ones, which is important for building fault tolerant quantum computers. It has also been linked to quantum contextuality, a concept thought to contribute to quantum computers' power. The technique was first proposed by Emanuel Knill in 2004, and further analyzed by Sergey Bravyi and Alexei Kitaev the same year. Thanks to the Gottesman–Knill theorem, it is known that some quantum operations (operations in the Clifford group) can be perfectly simulated in polynomial time on a classical computer. In order to achieve universal quantum computation, a quantum computer must be able to perform operations outside this set. Magic state distillation achieves this, in principle, by concentrating the usefulness of imperfect resources, represented by mixed states, into states that are conducive for performing operations that are difficult to simulate classically. A variety of qubit magic state distillation routines and distillation routines for qubits with various advantages have been proposed. == Stabilizer formalism == The Clifford group consists of a set of n {\displaystyle n} -qubit operations generated by the gates {H, S, CNOT} (where H is Hadamard and S is [ 1 0 0 i ] {\displaystyle {\begin{bmatrix}1&0\\0&i\end{bmatrix}}} ) called Clifford gates. The Clifford group generates stabilizer states which can be efficiently simulated classically, as shown by the Gottesman–Knill theorem. This set of gates with a non-Clifford operation is universal for quantum computation. == Magic states == Magic states are purified from n {\displaystyle n} copies of a mixed state ρ {\displaystyle \rho } . These states are typically provided via an ancilla to the circuit. A magic state for the π / 6 {\displaystyle \pi /6} rotation operator is | M ⟩ = cos ⁡ ( β / 2 ) | 0 ⟩ + e i π 4 sin ⁡ ( β / 2 ) | 1 ⟩ {\displaystyle |M\rangle =\cos(\beta /2)|0\rangle +e^{i{\frac {\pi }{4}}}\sin(\beta /2)|1\rangle } where β = arccos ⁡ ( 1 3 ) {\displaystyle \beta =\arccos \left({\frac {1}{\sqrt {3}}}\right)} . A non-Clifford gate can be generated by combining (copies of) magic states with Clifford gates. Since a set of Clifford gates combined with a non-Clifford gate is universal for quantum computation, magic states combined with Clifford gates are also universal. == Purification algorithm for distilling |M〉 == The first magic state distillation algorithm, invented by Sergey Bravyi and Alexei Kitaev, is as follows. Input: Prepare 5 imperfect states. Output: An almost pure state having a small error probability. repeat Apply the decoding operation of the five-qubit error correcting code and measure the syndrome. If the measured syndrome is | 0000 ⟩ {\displaystyle |0000\rangle } , the distillation attempt is successful. else Get rid of the resulting state and restart the algorithm. until The states have been distilled to the desired purity.

    Read more →
  • Leiden algorithm

    Leiden algorithm

    The Leiden algorithm is a community detection algorithm developed by Traag et al at Leiden University. It was developed as a modification of the Louvain method. Like the Louvain method, the Leiden algorithm attempts to optimize modularity in extracting communities from networks; however, it addresses key issues present in the Louvain method, namely poorly connected communities and the resolution limit of modularity. == Improvement over Louvain method == Broadly, the Leiden algorithm uses the same two primary phases as the Louvain algorithm: a local node moving step (though, the method by which nodes are considered in Leiden is more efficient) and a graph aggregation step. However, to address the issues with poorly-connected communities and the merging of smaller communities into larger communities (the resolution limit of modularity), the Leiden algorithm employs an intermediate refinement phase in which communities may be split to guarantee that all communities are well-connected. Consider, for example, the following graph: Three communities are present in this graph (each color represents a community). Additionally, the center "bridge" node (represented with an extra circle) is a member of the community represented by blue nodes. Now consider the result of a node-moving step which merges the communities denoted by red and green nodes into a single community (as the two communities are highly connected): Notably, the center "bridge" node is now a member of the larger red community after node moving occurs (due to the greedy nature of the local node moving algorithm). In the Louvain method, such a merging would be followed immediately by the graph aggregation phase. However, this causes a disconnection between two different sections of the community represented by blue nodes. In the Leiden algorithm, the graph is instead refined: The Leiden algorithm's refinement step ensures that the center "bridge" node is kept in the blue community to ensure that it remains intact and connected, despite the potential improvement in modularity from adding the center "bridge" node to the red community. == Graph components == Before defining the Leiden algorithm, it will be helpful to define some of the components of a graph. === Vertices and edges === A graph is composed of vertices (nodes) and edges. Each edge is connected to two vertices, and each vertex may be connected to zero or more edges. Edges are typically represented by straight lines, while nodes are represented by circles or points. In set notation, let V {\displaystyle V} be the set of vertices, and E {\displaystyle E} be the set of edges: V := { v 1 , v 2 , … , v n } E := { e i j , e i k , … , e k l } {\displaystyle {\begin{aligned}V&:=\{v_{1},v_{2},\dots ,v_{n}\}\\E&:=\{e_{ij},e_{ik},\dots ,e_{kl}\}\end{aligned}}} where e i j {\displaystyle e_{ij}} is the directed edge from vertex v i {\displaystyle v_{i}} to vertex v j {\displaystyle v_{j}} . We can also write this as an ordered pair: e i j := ( v i , v j ) {\displaystyle {\begin{aligned}e_{ij}&:=(v_{i},v_{j})\end{aligned}}} === Community === A community is a unique set of nodes: C i ⊆ V C i ⋂ C j = ∅ ∀ i ≠ j {\displaystyle {\begin{aligned}C_{i}&\subseteq V\\C_{i}&\bigcap C_{j}=\emptyset ~\forall ~i\neq j\end{aligned}}} and the union of all communities must be the total set of vertices: V = ⋃ i = 1 C i {\displaystyle {\begin{aligned}V&=\bigcup _{i=1}C_{i}\end{aligned}}} === Partition === A partition is the set of all communities: P = { C 1 , C 2 , … , C n } {\displaystyle {\begin{aligned}{\mathcal {P}}&=\{C_{1},C_{2},\dots ,C_{n}\}\end{aligned}}} == Partition quality == How communities are partitioned is an integral part on the Leiden algorithm. How partitions are decided can depend on how their quality is measured. Additionally, many of these metrics contain parameters of their own that can change the outcome of their communities. === Modularity === Modularity is a highly used quality metric for assessing how well a set of communities partition a graph. The equation for this metric is defined for an adjacency matrix, A, as: Q = 1 2 m ∑ i j ( A i j − k i k j 2 m ) δ ( c i , c j ) {\displaystyle Q={\frac {1}{2m}}\sum _{ij}(A_{ij}-{\frac {k_{i}k_{j}}{2m}})\delta (c_{i},c_{j})} where: A i j {\displaystyle A_{ij}} represents the edge weight between nodes i {\displaystyle i} and j {\displaystyle j} ; see Adjacency matrix; k i {\displaystyle k_{i}} and k j {\displaystyle k_{j}} are the sum of the weights of the edges attached to nodes i {\displaystyle i} and j {\displaystyle j} , respectively; m {\displaystyle m} is the sum of all of the edge weights in the graph; c i {\displaystyle c_{i}} and c j {\displaystyle c_{j}} are the communities to which the nodes i {\displaystyle i} and j {\displaystyle j} belong; and δ {\displaystyle \delta } is Kronecker delta function: δ ( c i , c j ) = { 1 if c i and c j are the same community 0 otherwise {\displaystyle {\begin{aligned}\delta (c_{i},c_{j})&={\begin{cases}1&{\text{if }}c_{i}{\text{ and }}c_{j}{\text{ are the same community}}\\0&{\text{otherwise}}\end{cases}}\end{aligned}}} === Reichardt Bornholdt Potts Model (RB) === One of the most well used metrics for the Leiden algorithm is the Reichardt Bornholdt Potts Model (RB). This model is used by default in most mainstream Leiden algorithm libraries under the name RBConfigurationVertexPartition. This model introduces a resolution parameter γ {\displaystyle \gamma } and is highly similar to the equation for modularity. This model is defined by the following quality function for an adjacency matrix, A, as: Q = ∑ i j ( A i j − γ k i k j 2 m ) δ ( c i , c j ) {\displaystyle Q=\sum _{ij}(A_{ij}-\gamma {\frac {k_{i}k_{j}}{2m}})\delta (c_{i},c_{j})} where: γ {\displaystyle \gamma } represents a linear resolution parameter === Constant Potts Model (CPM) === Another metric similar to RB is the Constant Potts Model (CPM). This metric also relies on a resolution parameter γ {\displaystyle \gamma } The quality function is defined as: H = − ∑ i j ( A i j w i j − γ ) δ ( c i , c j ) {\displaystyle H=-\sum _{ij}(A_{ij}w_{ij}-\gamma )\delta (c_{i},c_{j})} === Understanding Potts Model resolution parameters/Resolution limit === Typically Potts models such as RB or CPM include a resolution parameter in their calculation. Potts models are introduced as a response to the resolution limit problem that is present in modularity maximization based community detection. The resolution limit problem is that, for some graphs, maximizing modularity may cause substructures of a graph to merge and become a single community and thus smaller structures are lost. These resolution parameters allow modularity adjacent methods to be modified to suit the requirements of the user applying the Leiden algorithm to account for small substructures at a certain granularity. The figure on the right illustrates why resolution can be a helpful parameter when using modularity based quality metrics. In the first graph, modularity only captures the large scale structures of the graph; however, in the second example, a more granular quality metric could potentially detect all substructures in a graph. == Algorithm == The Leiden algorithm starts with a graph of disorganized nodes (a) and sorts it by partitioning them to maximize modularity (the difference in quality between the generated partition and a hypothetical randomized partition of communities). The method it uses is similar to the Louvain algorithm, except that after moving each node it also considers that node's neighbors that are not already in the community it was placed in. This process results in our first partition (b), also referred to as P {\displaystyle {\mathcal {P}}} . Then the algorithm refines this partition by first placing each node into its own individual community and then moving them from one community to another to maximize modularity. It does this iteratively until each node has been visited and moved, and each community has been refined - this creates partition (c), which is the initial partition of P refined {\displaystyle {\mathcal {P}}_{\text{refined}}} . Then an aggregate network (d) is created by turning each community into a node. P refined {\displaystyle {\mathcal {P}}_{\text{refined}}} is used as the basis for the aggregate network while P {\displaystyle {\mathcal {P}}} is used to create its initial partition. Because we use the original partition P {\displaystyle {\mathcal {P}}} in this step, we must retain it so that it can be used in future iterations. These steps together form the first iteration of the algorithm. In subsequent iterations, the nodes of the aggregate network (which each represent a community) are once again placed into their own individual communities and then sorted according to modularity to form a new P refined {\displaystyle {\mathcal {P}}_{\text{refined}}} , forming (e) in the above graphic. In the case depicted by the graph, the nodes were already sorted optimally, so no change too

    Read more →
  • GasBuddy

    GasBuddy

    GasBuddy is a technology company headquartered in Dallas, United States, that offers mobile applications and websites for tracking crowd-sourced locations and prices of gas stations and convenience stores in the United States and Canada. Their platforms offer information sourced from users, gas station operators, and partner companies. They also provide business-to-business services to gas stations and convenience store owners. == History == GasBuddy was founded in Minneapolis in 2000 by Dustin Coupal, Jason Toews as a community website for sharing gas prices. In 2004, they filed as a for-profit corporation in Minnesota under the name GasBuddy Organization Inc. In 2009, GasBuddy launched OpenStore, a platform that allows convenience stores to build and manage their own mobile apps. In 2010, the company launched its own mobile apps that allowed users to input gas prices from their smartphones. In 2013, Oil Price Information Service (OPIS), a subsidiary of UCG, acquired GasBuddy. OPIS is a provider of petroleum pricing and news for businesses. In 2016, IHS acquired OPIS, separating from GasBuddy, which remained with UCG as a subsidiary company. Initially only available in the United States and Canada, GasBuddy launched in Australia in March 2016. Also in that year, GasBuddy released a completely redesigned app, its first major redesign since its release in 2010. GasBuddy also unveiled a new logo and launched GasBuddy Business Pages. GasBuddy shut down the Australian version of their app in 2022. In 2017, GasBuddy launched a gas savings program titled "Pay with GasBuddy" intended to let consumers save at gas stations in the United States. In the same year, GasBuddy was involved in a lawsuit with Reveal Mobile, a location-based marketing company, over the sale of user location data. It was revealed that GasBuddy sold information on more than 4.5 million users to Reveal each month for $9.50 per 1000 users. According to CNET, that information included "users' latitude, longitude, IP address, and time stamps on the data collected," which sparked concern in the media and between its users. In 2021, the GasBuddy app rose to the most popular app on both Android and iPhone platforms in the wake of the Colonial Pipeline ransomware attack PDI acquired GasBuddy in 2021.

    Read more →
  • Ballin' (Mustard and Roddy Ricch song)

    Ballin' (Mustard and Roddy Ricch song)

    "Ballin'" is a song by American record producer Mustard featuring American rapper Roddy Ricch. The track was released as the third single from Mustard's third studio album, Perfect Ten, on August 20, 2019, though it was available as early as the end of June 2019. The song and its accompanying video received acclaim from music critics, with Complex magazine naming it the Best Song of 2019. It peaked at number 11 on the Billboard Hot 100, marking Mustard's highest charting song in the US. The song received a nomination for Best Rap/Sung Performance at the 2020 Grammy Awards, making it the first time Ricch has been nominated for a Grammy and Mustard's first nomination as an artist. Later in 2019, the two released another collaboration, "High Fashion". == Background == Roddy Ricch revealed in an interview that the song was composed in late 2018, but Mustard wanted to keep it for his album, Perfect Ten, which he was still working on. The song was later included on the album, released in June 2019. Ricch said he knew the song was "hard enough" the first time he heard it, while Mustard proclaimed "this is going to be the one". == Composition and lyrics == "Ballin'" has a "rags to riches" theme. In its intro, the song samples girl group 702's 1997 top ten hit "Get It Together". The song features a "smooth, bouncy beat", with Roddy Ricch rapping about his come-up and ascent in the music industry. In the first verse, Ricch salutes fellow Los Angeles rapper, the late Nipsey Hussle and his girlfriend Lauren London: "I run these racks up with my queen like London and Nip". The line simultaneously references Ricch and Hussle's collaboration "Racks in the Middle", released earlier in 2019 as Hussle's last single before his death. Billboard's Heran Mamo noted that "in typical Hussle fashion", Roddy Ricch "narrates his life's hardships before delving into his newfound treasures". == Critical reception == The song was widely acclaimed by music critics. Charles Holmes of Rolling Stone magazine called it "a song of the year contender", while Complex and Billboard both named it as a "standout track" on the album. Pitchfork magazine included "Ballin'" in its list of The Best Rap Songs of 2019 and called it "the centerpiece of Mustard's underappreciated album Perfect Ten". Complex later named it the Best Song of 2019, calling it "a feel-good anthem so infectious you'll need antibiotics just to stop running it back". == Chart performance == "Ballin'" was at the time Mustard's highest charting song in the US, peaking at number 11 on the Billboard Hot 100. It was also Roddy Ricch's highest charting song, until he surpassed it a week later, with the release of his album track "The Box", which eventually reached number 1 on the chart. It reached number one on Billboard's Rhythmic Songs chart, becoming Mustard's second number one following "Pure Water" and Ricch's first number one. The song also topped the Rap Airplay chart. == Music video == The music video for the track was teased by Mustard on his Instagram page on September 29, 2019. The music video for the track was eventually released on October 2, 2019 to critical acclaim. The video features Mustard and Roddy Ricch driving a Lamborghini Aventador in Los Angeles, where they both are from, playing poker in a casino, and going to a strip club. This is contrasted with scenes in which Mustard and Roddy Ricch as children play cards with Monopoly money and playing with miniature toy Lamborghinis together, aspiring for wealth and luxury, representing how they went from "rags to riches". The video also pays tribute to rapper Nipsey Hussle, who had been killed a few months ago. == Live performances == On December 16, 2019, Roddy Ricch performed the song live, alongside an 8-piece orchestra, at Peppermint Club in Los Angeles for Audiomack's Trap Symphony series. Along with Mustard, he performed it at The Pop Out: Ken & Friends on June 19, 2024. == Other uses == The song can be heard on "Elyse's Skit", track 10 off Roddy Ricch's debut album Please Excuse Me for Being Antisocial. In the skit, which is an actual voicenote recording, the mother of a woman named Elyse sends her daughter a voicenote, with "Ballin'" playing in the background, while the mother proceeds to say "I can't get that damn song out my head", jokingly calling it "inappropriate music". Ricch called the skit "something natural". In 2023, AI covers of the song using models based on pop culture characters and real-world celebrities gained viral popularity. == Awards and nominations == 62nd Annual Grammy Awards == Charts == == Certifications ==

    Read more →
  • Applications of artificial intelligence

    Applications of artificial intelligence

    Artificial intelligence is the capability of computational systems to perform tasks that are typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. Artificial intelligence has been used in applications throughout industry and academia. Within the field of Artificial Intelligence, there are multiple subfields. The subfield of machine learning has been used for various scientific and commercial purposes, including language translation, image recognition, decision-making, credit scoring, and e-commerce. In recent years, massive advancements have been made in the field of generative artificial intelligence, which uses generative models to generate text, images, videos, and other forms of data. This article describes applications of AI in different sectors. == Agriculture == In agriculture, AI has been proposed as a way for farmers to identify areas that need irrigation, fertilization, or pesticide treatments to increase yields, thereby improving efficiency. AI has been used to attempt to classify livestock pig call emotions, automate greenhouses, detect diseases and pests, and optimize irrigation. == AI-assisted software develoment == == Architecture and design == == Business == A 2023 study found that generative AI increased productivity by 15% in contact centers. Another 2023 study found it increased productivity by up to 40% in writing tasks. An August 2025 review by MIT found that of surveyed companies, 95% did not report any improvement in revenue from the use of AI. A September 2025 article by the Harvard Business Review describes how increased use of AI does not automatically lead to increases in revenue or actual productivity. Referring to "AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task" the article coins the term workslop. Per studies done in collaboration with the Stanford Social Media Lab, workslop does not improve productivity and undermines trust and collaboration among colleagues. In telehealth, agentic AI is reportedly facilitating the creation of large business models (millions in annual profit) with 1-2 employees, such as MEDVi, which as of August 2025 only had 2 employees and ~$75M in annual profit for GLP-1 weight-loss telehealth services. == Chatbots == == Computer science == === Programming assistance === ==== AI-assisted software development ==== AI can be used for real-time code completion, chat, and automated test generation. These tools are typically integrated with editors and IDEs as plugins. AI-assisted software development systems differ in functionality, quality, speed, and approach to privacy. Creating software primarily via AI is known as "vibe coding". Code created or suggested by AI can be incorrect or inefficient. The use of AI-assisted coding can potentially speed-up software development, but can also slow-down the process by creating more work when debugging and testing. The rush to prematurely adopt AI technology can also incur additional technical debt. AI also requires additional consideration and careful review for cybersecurity, since AI coding software is trained on a wide range of code of inconsistent quality and often replicates poor practices. ==== Neural network design ==== AI can be used to create other AIs. For example, around November 2017, Google's AutoML project to evolve new neural net topologies created NASNet, a system optimized for ImageNet and POCO F1. NASNet's performance exceeded all previously published performance on ImageNet. ==== Quantum computing ==== Research and development of quantum computers has been performed with machine learning algorithms. For example, there is a prototype, photonic, quantum memristive device for neuromorphic computers (NC)/artificial neural networks and NC-using quantum materials with some variety of potential neuromorphic computing-related applications. The use of quantum machine learning for quantum simulators has been proposed for solving physics and chemistry problems. === Historical contributions === AI researchers have created many tools to solve the most difficult problems in computer science. Many of their inventions have been adopted by mainstream computer science and are no longer considered AI. All of the following were originally developed in AI laboratories: Time sharing Interactive interpreters Graphical user interfaces and the computer mouse Rapid application development environments The linked list data structure Automatic storage management Symbolic programming Functional programming Dynamic programming Object-oriented programming Optical character recognition Constraint satisfaction == Customer service == === Human resources === AI programs have been used in hiring processes to screen resumes and rank candidates based on their qualifications, predict a candidate's likelihood of success in a given role, and automate repetitive communication tasks using chatbots. Studies on these programs have identified tendencies for gender bias, favoring male names and male-coded characteristics, as well as bias against disabled candidates and racial minorities. === Online and telephone customer service === AI underlies avatars (automated online assistants) on web pages. It can reduce operation and training costs. Pypestream automated customer service for its mobile application to streamline communication with customers. A Google app analyzes language and converts speech into text. The platform can identify angry customers through their language and respond appropriately. Amazon uses a chatbot for customer service that can perform tasks like checking the status of an order, cancelling orders, offering refunds and connecting the customer with a human representative. Generative AI (GenAI), such as ChatGPT, is increasingly used in business to automate tasks and enhance decision-making. === Hospitality === In the hospitality industry, AI is used to reduce repetitive tasks, analyze trends, interact with guests, and predict customer needs. AI hotel services come in the form of a chatbot, application, virtual voice assistant and service robots. == Education == In educational institutions, AI has been used to automate routine tasks such as attendance tracking, grading, and marking. AI tools have also been used to monitor student progress and analyze learning behaviors, with the goal of facilitating timely interventions for students facing academic challenges. == Energy and environment == === Energy system === The U.S. Department of Energy wrote in an April 2024 report that AI may have applications in modeling power grids, reviewing federal permits with large language models, predicting levels of renewable energy production, and improving the planning process for electrical vehicle charging networks. Other studies have suggested that machine learning can be used for energy consumption prediction and scheduling, e.g. to help with renewable energy intermittency management (see also: smart grid and climate change mitigation in the power grid). === Environmental monitoring === Autonomous ships that monitor the ocean, AI-driven satellite data analysis, passive acoustics or remote sensing and other applications of environmental monitoring make use of machine learning. For example, "Global Plastic Watch" is an AI-based satellite monitoring-platform for analysis/tracking of plastic waste sites to help prevention of plastic pollution – primarily ocean pollution – by helping identify who and where mismanages plastic waste, dumping it into oceans. === Early-warning systems === Machine learning can be used to spot early-warning signs of disasters and environmental issues, possibly including natural pandemics, earthquakes, landslides, heavy rainfall, long-term water supply vulnerability, tipping-points of ecosystem collapse, cyanobacterial bloom outbreaks, and droughts. === Economic and social challenges === The University of Southern California launched the Center for Artificial Intelligence in Society, with the goal of using AI to address problems such as homelessness. Stanford researchers use AI to analyze satellite images to identify high poverty areas. == Entertainment and media == === Media === AI applications analyze media content such as movies, TV programs, advertisement videos or user-generated content. The solutions often involve computer vision. Typical scenarios include the analysis of images using object recognition or face recognition techniques, or the analysis of video for scene recognizing scenes, objects or faces. AI-based media analysis can facilitate media search, the creation of descriptive keywords for content, content policy monitoring (such as verifying the suitability of content for a particular TV viewing time), speech to text for archival or other purposes, and the detection of logos, products or celebrity faces for ad placement. Motion interpolation Pixel-art scaling algorithms Image scaling Imag

    Read more →
  • DONE

    DONE

    The Data-based Online Nonlinear Extremumseeker (DONE) algorithm is a black-box optimization algorithm. DONE models the unknown cost function and attempts to find an optimum of the underlying function. The DONE algorithm is suitable for optimizing costly and noisy functions and does not require derivatives. An advantage of DONE over similar algorithms, such as Bayesian optimization, is that the computational cost per iteration is independent of the number of function evaluations. == Methods == The DONE algorithm was first proposed by Hans Verstraete and Sander Wahls in 2015. The algorithm fits a surrogate model based on random Fourier features and then uses a well-known L-BFGS algorithm to find an optimum of the surrogate model. == Applications == DONE was first demonstrated for maximizing the signal in optical coherence tomography measurements, but has since then been applied to various other applications. For example, it was used to help extending the field of view in light sheet fluorescence microscopy.

    Read more →
  • Reflection (computer graphics)

    Reflection (computer graphics)

    Reflection in computer graphics is used to render reflective objects like mirrors and shiny surfaces. Accurate reflections are commonly computed using ray tracing whereas approximate reflections can usually be computed faster by using simpler methods such as environment mapping. Reflections on shiny surfaces like wood or tile can add to the photorealistic effects of a 3D rendering. == Approaches to reflection rendering == For rendering environment reflections there exist many techniques that differ in precision, computational and implementation complexity. Combination of these techniques are also possible. Image order rendering algorithms based on tracing rays of light, such as ray tracing or path tracing, typically compute accurate reflections on general surfaces, including multiple reflections and self reflections. However these algorithms are generally still too computationally expensive for real time rendering (even though specialized HW exists, such as Nvidia RTX) and require a different rendering approach from typically used rasterization. Reflections on planar surfaces, such as planar mirrors or water surfaces, can be computed simply and accurately in real time with two pass rendering — one for the viewer, one for the view in the mirror, usually with the help of stencil buffer. Some older video games used a trick to achieve this effect with one pass rendering by putting the whole mirrored scene behind a transparent plane representing the mirror. Reflections on non-planar (curved) surfaces are more challenging for real time rendering. Main approaches that are used include: Environment mapping (e.g. cube mapping): a technique that has been widely used e.g. in video games, offering reflection approximation that's mostly sufficient to the eye, but lacking self-reflections and requiring pre-rendering of the environment map. The precision can be increased by using a spatial array of environment maps instead of just one. It is also possible to generate cube map reflections in real time, at the cost of memory and computational requirements. Screen space reflections (SSR): a more expensive technique that traces rays come from pixel data.This requires the data of surface normal and either depth buffer (local space) or position buffer (world space).The disadvantage is that objects not captured in the rendered frame cannot appear in the reflections, which results in unresolved and or false intersections causing artefacts such as reflection vanishment and virtual image. SSR was originally introduced as Real Time Local Reflections in CryENGINE 3. == Types of reflection == Polished - A polished reflection is an undisturbed reflection, like a mirror or chrome surface. Blurry - A blurry reflection means that tiny random bumps, or microfacets, on the surface of the material causes the reflection to be blurry. Metallic - A reflection is metallic if the highlights and reflections retain the color of the reflective object. Glossy - This term can be misused: sometimes, it is a setting which is the opposite of blurry (e.g. when "glossiness" has a low value, the reflection is blurry). Sometimes the term is used as a synonym for "blurred reflection". Glossy used in this context means that the reflection is actually blurred. === Polished or mirror reflection === Mirrors are usually almost 100% reflective. === Metallic reflection === Normal (nonmetallic) objects reflect light and colors in the original color of the object being reflected. Metallic objects reflect lights and colors altered by the color of the metallic object itself. === Blurry reflection === Many materials are imperfect reflectors, where the reflections are blurred to various degrees due to surface roughness that scatters the rays of the reflections. === Glossy reflection === Fully glossy reflection, shows highlights from light sources, but does not show a clear reflection from objects. == Examples of reflections == === Wet floor reflections === The wet floor effect is a graphic effects technique popular in conjunction with Web 2.0 style pages, particularly in logos. The effect can be done manually or created with an auxiliary tool which can be installed to create the effect automatically. Unlike a standard computer reflection (and the Java water effect popular in first-generation web graphics), the wet floor effect involves a gradient and often a slant in the reflection, so that the mirrored image appears to be hovering over or resting on a wet floor.

    Read more →
  • Data (word)

    Data (word)

    The word data is most often used as a singular collective mass noun in educated everyday usage. However, due to the history and etymology of the word, considerable controversy has existed on whether it should be considered a mass noun used with verbs conjugated in the singular, or should be treated as the plural of the now-rarely-used datum. == Usage in English == In one sense, data is the plural form of datum. Datum actually can also be a count noun with the plural datums (see usage in datum article) that can be used with cardinal numbers (e.g., "80 datums"); data (originally a Latin plural) is not used like a normal count noun with cardinal numbers and can be plural with plural determiners such as these and many, or it can be used as a mass noun with a verb in the singular form. Even when a very small quantity of data is referenced (one number, for example), the phrase piece of data is often used, as opposed to datum. The debate over appropriate usage continues, but "data" as a singular form is far more common. In English, the word datum is still used in the general sense of "an item given". In cartography, geography, nuclear magnetic resonance and technical drawing, it is often used to refer to a single specific reference datum from which distances to all other data are measured. Any measurement or result is a datum, though data point is now far more common. Data is indeed most often used as a singular mass noun in educated everyday usage. Some major newspapers, such as The New York Times, use it either in the singular or plural. In The New York Times, the phrases "the survey data are still being analyzed" and "the first year for which data is available" have appeared within one day. The Wall Street Journal explicitly allows this usage in its style guide. The Associated Press style guide classifies data as a collective noun that takes the singular when treated as a unit but the plural when referring to individual items (e.g., "The data is sound" and "The data have been carefully collected"). In scientific writing, data is often treated as a plural, as in These data do not support the conclusions, but the word is also used as a singular mass entity like information (e.g., in computing and related disciplines). British usage now widely accepts treating data as singular in standard English, including everyday newspaper usage at least in non-scientific use. UK scientific publishing still prefers treating it as a plural. Some UK university style guides recommend using data for both singular and plural use, and others recommend treating it only as a singular in connection with computers. The IEEE Computer Society allows usage of data as either a mass noun or plural based on author preference, while IEEE in the editorial style manual indicates to always use the plural form. Some professional organizations and style guides require that authors treat data as a plural noun. For example, the Air Force Flight Test Center once stated that the word data is always plural, never singular.

    Read more →
  • Online analytical processing

    Online analytical processing

    In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture. OLAP tools enable users to analyse multidimensional data interactively from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing. Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends. By contrast, the drill-down is a technique that allows users to navigate through the details. For instance, users can view the sales by individual products that make up a region's sales. Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view (dicing) the slices from different viewpoints. These viewpoints are sometimes called dimensions (such as looking at the same sales by salesperson, or by date, or by customer, or by product, or by region, etc.). Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad hoc queries with a rapid execution time. They borrow aspects of navigational databases, hierarchical databases and relational databases. OLAP is typically contrasted to OLTP (online transaction processing), which is generally characterized by much less complex queries, in a larger volume, to process transactions rather than for the purpose of business intelligence or reporting. Whereas OLAP systems are mostly optimized for read, OLTP has to process all kinds of queries (read, insert, update and delete). == Overview of OLAP systems == At the core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures that are categorized by dimensions. The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space. The usual interface to manipulate an OLAP cube is a matrix interface, like Pivot tables in a spreadsheet program, which performs projection operations along the dimensions, such as aggregation or averaging. The cube metadata is typically created from a star schema or snowflake schema or fact constellation of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is what describes these labels; it provides information about the measure. A simple example would be a cube that contains a store's sales as a measure, and Date/Time as a dimension. Each Sale has a Date/Time label that describes more about that sale. For example: Sales Fact Table +-------------+----------+ | sale_amount | time_id | +-------------+----------+ Time Dimension | 930.10| 1234 |----+ +---------+-------------------+ +-------------+----------+ | | time_id | timestamp | | +---------+-------------------+ +---->| 1234 | 20080902 12:35:43 | +---------+-------------------+ === Multidimensional databases === Multidimensional structure is defined as "a variation of the relational model that uses multidimensional structures to organize data and express the relationships between data". The structure is broken into cubes and the cubes are able to store and access data within the confines of each cube. "Each cell within a multidimensional structure contains aggregated data related to elements along each of its dimensions". Even when data is manipulated it remains easy to access and continues to constitute a compact database format. The data still remains interrelated. Multidimensional structure is quite popular for analytical databases that use online analytical processing (OLAP) applications. Analytical databases use these databases because of their ability to deliver answers to complex business queries swiftly. Data can be viewed from different angles, which gives a broader perspective of a problem unlike other models. === Aggregations === It has been claimed that for complex queries OLAP cubes can produce an answer in around 0.1% of the time required for the same query on OLTP relational data. The most important mechanism in OLAP which allows it to achieve such performance is the use of aggregations. Aggregations are built from the fact table by changing the granularity on specific dimensions and aggregating up data along these dimensions, using an aggregate function (or aggregation function). The number of possible aggregations is determined by every possible combination of dimension granularities. The combination of all possible aggregations and the base data contains the answers to every query which can be answered from the data. Because usually there are many aggregations that can be calculated, often only a predetermined number are fully calculated; the remainder are solved on demand. The problem of deciding which aggregations (views) to calculate is known as the view selection problem. View selection can be constrained by the total size of the selected set of aggregations, the time to update them from changes in the base data, or both. The objective of view selection is typically to minimize the average time to answer OLAP queries, although some studies also minimize the update time. View selection is NP-complete. Many approaches to the problem have been explored, including greedy algorithms, randomized search, genetic algorithms and A search algorithm. Some aggregation functions can be computed for the entire OLAP cube by precomputing values for each cell, and then computing the aggregation for a roll-up of cells by aggregating these aggregates, applying a divide and conquer algorithm to the multidimensional problem to compute them efficiently. For example, the overall sum of a roll-up is just the sum of the sub-sums in each cell. Functions that can be decomposed in this way are called decomposable aggregation functions, and include COUNT, MAX, MIN, and SUM, which can be computed for each cell and then directly aggregated; these are known as self-decomposable aggregation functions. In other cases, the aggregate function can be computed by computing auxiliary numbers for cells, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases, the aggregate function cannot be computed without analyzing the entire set at once, though in some cases approximations can be computed; examples include DISTINCT COUNT, MEDIAN, and MODE; for example, the median of a set is not the median of medians of subsets. These latter are difficult to implement efficiently in OLAP, as they require computing the aggregate function on the base data, either computing them online (slow) or precomputing them for possible rollouts (large space). == Types == OLAP systems have been traditionally categorized using the following taxonomy. === Multidimensional OLAP (MOLAP) === MOLAP (multi-dimensional online analytical processing) is the classic form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database. Some MOLAP tools require the pre-computation and storage of derived data, such as consolidations – the operation known as processing. Such MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The data cube contains all the possible answers to a given range of questions. As a result, they have a very fast response to queries. On the other hand, updating can take a long time depending on the degree of pre-computation. Pre-computation can also lead to what is known as data explosion. Other MOLAP tools, particularly those that implement the functional database model do not pre-compute derived data but make all calculations on demand other than those that were previously requested and stored in a cache. Advantages of MOLAP Fast query performance due to optimized storage, multidimensional indexing and caching. Smaller on-disk size of data compared to data stored in relational database due to compression techniques. Automated computation of higher-level aggregates of the data. It is very compact for low dimension data se

    Read more →