Online machine learning

Online machine learning

In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time, e.g., prediction of prices in the financial international markets. Online learning algorithms may be prone to catastrophic interference, a problem that can be addressed by incremental learning approaches. Online machine learning algorithms find applications in a wide variety of fields such as sponsored search to maximize ad revenue, portfolio optimization, shortest path prediction (with stochastic weights, e.g. traffic on roads for a maps application), spam filtering, real-time fraud detection, dynamic pricing for e-commerce, etc. There is also growing interest in usage of online learning paradigms for LLMs to enable continuous, real-time adaptation after the initial training. == Introduction == In the setting of supervised learning, a function of f : X → Y {\displaystyle f:X\to Y} is to be learned, where X {\displaystyle X} is thought of as a space of inputs and Y {\displaystyle Y} as a space of outputs, that predicts well on instances that are drawn from a joint probability distribution p ( x , y ) {\displaystyle p(x,y)} on X × Y {\displaystyle X\times Y} . In reality, the learner never knows the true distribution p ( x , y ) {\displaystyle p(x,y)} over instances. Instead, the learner usually has access to a training set of examples ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} . In this setting, the loss function is given as V : Y × Y → R {\displaystyle V:Y\times Y\to \mathbb {R} } , such that V ( f ( x ) , y ) {\displaystyle V(f(x),y)} measures the difference between the predicted value f ( x ) {\displaystyle f(x)} and the true value y {\displaystyle y} . The ideal goal is to select a function f ∈ H {\displaystyle f\in {\mathcal {H}}} , where H {\displaystyle {\mathcal {H}}} is a space of functions called a hypothesis space, so that some notion of total loss is minimized. Depending on the type of model (statistical or adversarial), one can devise different notions of loss, which lead to different learning algorithms. == Statistical view of online learning == In statistical learning models, the training sample ( x i , y i ) {\displaystyle (x_{i},y_{i})} are assumed to have been drawn from the true distribution p ( x , y ) {\displaystyle p(x,y)} and the objective is to minimize the expected "risk" I [ f ] = E [ V ( f ( x ) , y ) ] = ∫ V ( f ( x ) , y ) d p ( x , y ) . {\displaystyle I[f]=\mathbb {E} [V(f(x),y)]=\int V(f(x),y)\,dp(x,y)\ .} A common paradigm in this situation is to estimate a function f ^ {\displaystyle {\hat {f}}} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization). The choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. A purely online model in this category would learn based on just the new input ( x t + 1 , y t + 1 ) {\displaystyle (x_{t+1},y_{t+1})} , the current best predictor f t {\displaystyle f_{t}} and some extra stored information (which is usually expected to have storage requirements independent of training data size). For many formulations, for example nonlinear kernel methods, true online learning is not possible, though a form of hybrid online learning with recursive algorithms can be used where f t + 1 {\displaystyle f_{t+1}} is permitted to depend on f t {\displaystyle f_{t}} and all previous data points ( x 1 , y 1 ) , … , ( x t , y t ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{t},y_{t})} . In this case, the space requirements are no longer guaranteed to be constant since it requires storing all previous data points, but the solution may take less time to compute with the addition of a new data point, as compared to batch learning techniques. A common strategy to overcome the above issues is to learn using mini-batches, which process a small batch of b ≥ 1 {\displaystyle b\geq 1} data points at a time, this can be considered as pseudo-online learning for b {\displaystyle b} much smaller than the total number of training points. Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for example, stochastic gradient descent. When combined with backpropagation, this is currently the de facto training method for training artificial neural networks. === Example: linear least squares === The simple example of linear least squares is used to explain a variety of ideas in online learning. The ideas are general enough to be applied to other settings, for example, with other convex loss functions. === Batch learning === Consider the setting of supervised learning with f {\displaystyle f} being a linear function to be learned: f ( x j ) = ⟨ w , x j ⟩ = w ⋅ x j {\displaystyle f(x_{j})=\langle w,x_{j}\rangle =w\cdot x_{j}} where x j ∈ R d {\displaystyle x_{j}\in \mathbb {R} ^{d}} is a vector of inputs (data points) and w ∈ R d {\displaystyle w\in \mathbb {R} ^{d}} is a linear filter vector. The goal is to compute the filter vector w {\displaystyle w} . To this end, a square loss function V ( f ( x j ) , y j ) = ( f ( x j ) − y j ) 2 = ( ⟨ w , x j ⟩ − y j ) 2 {\displaystyle V(f(x_{j}),y_{j})=(f(x_{j})-y_{j})^{2}=(\langle w,x_{j}\rangle -y_{j})^{2}} is used to compute the vector w {\displaystyle w} that minimizes the empirical loss I n [ w ] = ∑ j = 1 n V ( ⟨ w , x j ⟩ , y j ) = ∑ j = 1 n ( x j T w − y j ) 2 {\displaystyle I_{n}[w]=\sum _{j=1}^{n}V(\langle w,x_{j}\rangle ,y_{j})=\sum _{j=1}^{n}(x_{j}^{\mathsf {T}}w-y_{j})^{2}} where y j ∈ R . {\displaystyle y_{j}\in \mathbb {R} .} Let X {\displaystyle X} be the i × d {\displaystyle i\times d} data matrix and y ∈ R i {\displaystyle y\in \mathbb {R} ^{i}} is the column vector of target values after the arrival of the first i {\displaystyle i} data points. Assuming that the covariance matrix Σ i = X T X {\displaystyle \Sigma _{i}=X^{\mathsf {T}}X} is invertible (otherwise it is preferential to proceed in a similar fashion with Tikhonov regularization), the best solution f ∗ ( x ) = ⟨ w ∗ , x ⟩ {\displaystyle f^{}(x)=\langle w^{},x\rangle } to the linear least squares problem is given by w ∗ = ( X T X ) − 1 X T y = Σ i − 1 ∑ j = 1 i x j y j . {\displaystyle w^{}=(X^{\mathsf {T}}X)^{-1}X^{\mathsf {T}}y=\Sigma _{i}^{-1}\sum _{j=1}^{i}x_{j}y_{j}.} Now, calculating the covariance matrix Σ i = ∑ j = 1 i x j x j T {\displaystyle \Sigma _{i}=\sum _{j=1}^{i}x_{j}x_{j}^{\mathsf {T}}} takes time O ( i d 2 ) {\displaystyle O(id^{2})} , inverting the d × d {\displaystyle d\times d} matrix takes time O ( d 3 ) {\displaystyle O(d^{3})} , while the rest of the multiplication takes time O ( d 2 ) {\displaystyle O(d^{2})} , giving a total time of O ( i d 2 + d 3 ) {\displaystyle O(id^{2}+d^{3})} . When there are n {\displaystyle n} total points in the dataset, to recompute the solution after the arrival of every datapoint i = 1 , … , n {\displaystyle i=1,\ldots ,n} , the naive approach will have a total complexity O ( n 2 d 2 + n d 3 ) {\displaystyle O(n^{2}d^{2}+nd^{3})} . Note that when storing the matrix Σ i {\displaystyle \Sigma _{i}} , then updating it at each step needs only adding x i + 1 x i + 1 T {\displaystyle x_{i+1}x_{i+1}^{\mathsf {T}}} , which takes O ( d 2 ) {\displaystyle O(d^{2})} time, reducing the total time to O ( n d 2 + n d 3 ) = O ( n d 3 ) {\displaystyle O(nd^{2}+nd^{3})=O(nd^{3})} , but with an additional storage space of O ( d 2 ) {\displaystyle O(d^{2})} to store Σ i {\displaystyle \Sigma _{i}} . === Online learning: recursive least squares === The recursive least squares (RLS) algorithm considers an online approach to the least squares problem. It can be shown that by initialising w 0 = 0 ∈ R d {\displaystyle \textstyle w_{0}=0\in \mathbb {R} ^{d}} and Γ 0 = I ∈ R d × d {\displaystyle \textstyle \Gamma _{0}=I\in \mathbb {R} ^{d\times d}} , the solution of the linear least squares problem given in the previous section can be computed by the following iteration: Γ i = Γ i − 1 − Γ i − 1 x i x i T Γ i − 1 1 + x i T Γ i − 1 x i {\displaystyle \Gamma _{i}=\Gamma _{i-1}-{\frac {\Gamma _{i-1}x_{i}x_{i}^{\mathsf {T}}\Gamma _{i-1}}{1+x_{i}^{\mathsf {T}}\Gamma _{i-1}x_{i}}}} w i = w i − 1 − Γ i x i ( x i T w i − 1 − y i ) {\displaystyle w_{i}=w_{i-1}-\Gamma _{i}x_{i}\left(x_{i}^{\mathsf {T}}w_{

Event store

An event store is a type of database optimized for storage of events. Conceptually, an event store records only the events affecting an entity, dossier, or policy, and the state of the entity at any point in its history can be reconstructed by replaying its contributing events in sequential order. Events (and their corresponding data) are the only "real" facts that should be stored in the database. All other objects can be derived from these events, meaning they are instantiated in memory by runtime code as needed (e.g. for showing in a user interface). In theory, any object that aggregates over recorded event data is not stored in the database. Instead these objects are built 'on the fly', by traversing the event history. When the aggregated object instance is no longer needed, it can simply be discarded (released from memory). == Example with insurance policies == For example, the event store concept of a database can be applied to insurance policies or pension dossiers. In these policies or dossiers the instantiation of each object that make up the dossier or policy (the person, partner(s), employments, etc.) can be derived and can be instantiated in memory based on the real world events. == Double timeline == A crucial part of an event store database is that each event has a double timeline: This enables event stores to correct errors of events that have been entered into the event store database before. The two dates are: Valid date is the date at which the event has become valid. Transaction date is the date at which the event is entered into the database. == Error correction == Another crucial part of an event store database is that events that are stored are not allowed to be changed. Once stored, also erroneous events are not changed anymore. The only way to change (or better: correct) these events is to instantiate a new event with the new values and using the double timeline. A correcting event would have the new values of the original event, with an event data of that corrected event, but a different transaction date. This mechanism ensures reproducibility at each moment in the time, even in the time period before the correction has taken place. It also allows to reproduce situations based on erroneous events (if required). == Advantages and disadvantages == One advantage of the event store concept is that handling the effects of back dated events (events that take effect before previous events and that may even invalidate them) is much easier. An event store will simplify the code in that rolling back erroneous situations and rolling up the new, correct situations is not needed anymore. Disadvantage may be that the code needs to re-instantiate all objects in memory based on the events each time a service call is received for a specific dossier or policy. == Compared to regular databases == In regular databases, handling backdated events to correct previous, erroneous events can be painful as it often results in rolling back all previous, erroneous transactions and objects and rolling up the new, correct transactions and objects. In an event store, only the new event (and its corresponding facts) are stored. The code will then redetermine the transactions and objects based on the new facts in memory.

ID3 algorithm

In decision tree learning, ID3 (Iterative Dichotomiser 3) is a greedy algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm. The 3 in the name is meant to signify that this was Quinlan's third attempt at a model based on entropy-based splitting, and the term dichotimser is a misnomer as it implies a binary split, but the ID3 algorithm can split on multi-valued attributes. == Algorithm == The ID3 algorithm begins with the original set S {\displaystyle S} as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S {\displaystyle S} and calculates the entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} or the information gain I G ( S ) {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S {\displaystyle S} is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the population with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch. === Summary === Calculate the entropy of every attribute a {\displaystyle a} of the data set S {\displaystyle S} . Partition ("split") the set S {\displaystyle S} into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum. Make a decision tree node containing that attribute. Recurse on subsets using the remaining attributes. === Properties === ID3 does not guarantee an optimal solution. It can converge upon local optima. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming. === Usage === The ID3 algorithm is used by training on a data set S {\displaystyle S} to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node. == The ID3 metrics == === Entropy === Entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} is a measure of the amount of uncertainty in the (data) set S {\displaystyle S} (i.e. entropy characterizes the (data) set S {\displaystyle S} ). H ( S ) = ∑ x ∈ X − p ( x ) log 2 ⁡ p ( x ) {\displaystyle \mathrm {H} {(S)}=\sum _{x\in X}{-p(x)\log _{2}p(x)}} Where, S {\displaystyle S} – The current dataset for which entropy is being calculated This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. X {\displaystyle X} – The set of classes in S {\displaystyle S} p ( x ) {\displaystyle p(x)} – The proportion of the number of elements in class x {\displaystyle x} to the number of elements in set S {\displaystyle S} When H ( S ) = 0 {\displaystyle \mathrm {H} {(S)}=0} , the set S {\displaystyle S} is perfectly classified (i.e. all elements in S {\displaystyle S} are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set S {\displaystyle S} on this iteration. Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable (discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values. Its accuracy can be improved by preprocessing the data. === Information gain === Information gain I G ( A ) {\displaystyle IG(A)} is the measure of the difference in entropy from before to after the set S {\displaystyle S} is split on an attribute A {\displaystyle A} . In other words, how much uncertainty in S {\displaystyle S} was reduced after splitting set S {\displaystyle S} on attribute A {\displaystyle A} . I G ( S , A ) = H ( S ) − ∑ t ∈ T p ( t ) H ( t ) = H ( S ) − H ( S | A ) . {\displaystyle IG(S,A)=\mathrm {H} {(S)}-\sum _{t\in T}p(t)\mathrm {H} {(t)}=\mathrm {H} {(S)}-\mathrm {H} {(S|A)}.} Where, H ( S ) {\displaystyle \mathrm {H} (S)} – Entropy of set S {\displaystyle S} T {\displaystyle T} – The subsets created from splitting set S {\displaystyle S} by attribute A {\displaystyle A} such that S = ⋃ t ∈ T t {\displaystyle S=\bigcup _{t\in T}t} p ( t ) {\displaystyle p(t)} – The proportion of the number of elements in t {\displaystyle t} to the number of elements in set S {\displaystyle S} H ( t ) {\displaystyle \mathrm {H} (t)} – Entropy of subset t {\displaystyle t} In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set S {\displaystyle S} on this iteration.

Generalized blockmodeling of binary networks

Generalized blockmodeling of binary networks (also relational blockmodeling) is an approach of generalized blockmodeling, analysing the binary network(s). As most network analyses deal with binary networks, this approach is also considered as the fundamental approach of blockmodeling. This is especially noted, as the set of ideal blocks, when used for interpretation of blockmodels, have binary link patterns, which precludes them to be compared with valued empirical blocks. When analysing the binary networks, the criterion function is measuring block inconsistencies, while also reporting the possible errors. The ideal block in binary blockmodeling has only three types of conditions: "a certain cell must be (at least) 1, a certain cell must be 0 and the f {\displaystyle f} over each row (or column) must be at least 1". It is also used as a basis for developing the generalized blockmodeling of valued networks.

Multiple discriminant analysis

Multiple Discriminant Analysis (MDA) is a multivariate dimensionality reduction technique. It has been used to predict signals as diverse as neural memory traces and corporate failure. MDA is not directly used to perform classification. It merely supports classification by yielding a compressed signal amenable to classification. The method described in Duda et al. (2001) §3.8.3 projects the multivariate signal down to an M−1 dimensional space where M is the number of categories. MDA is useful because most classifiers are strongly affected by the curse of dimensionality. In other words, when signals are represented in very-high-dimensional spaces, the classifier's performance is catastrophically impaired by the overfitting problem. This problem is reduced by compressing the signal down to a lower-dimensional space as MDA does. MDA has been used to reveal neural codes.

Distributed file system for cloud

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations (create, delete, modify, read, write) on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system. Users can share computing resources through the Internet thanks to cloud computing which is typically characterized by scalable and elastic resources – such as physical servers, applications and any services that are virtualized and allocated dynamically. Synchronization is required to make sure that all devices are up-to-date. Distributed file systems enable many big, medium, and small enterprises to store and access their remote data as they do local data, facilitating the use of variable resources. == Overview == === History === Today, there are many implementations of distributed file systems. The first file servers were developed by researchers in the 1970s. Sun Microsystem's Network File System became available in the 1980s. Before that, people who wanted to share files used the sneakernet method, physically transporting files on storage media from place to place. Once computer networks started to proliferate, it became obvious that the existing file systems had many limitations and were unsuitable for multi-user environments. Users initially used FTP to share files. FTP first ran on the PDP-10 at the end of 1973. Even with FTP, files needed to be copied from the source computer onto a server and then from the server onto the destination computer. Users were required to know the physical addresses of all computers involved with the file sharing. === Supporting techniques === Modern data centers must support large, heterogenous environments, consisting of large numbers of computers of varying capacities. Cloud computing coordinates the operation of all such systems, with techniques such as data center networking (DCN), the MapReduce framework, which supports data-intensive computing applications in parallel and distributed systems, and virtualization techniques that provide dynamic resource allocation, allowing multiple operating systems to coexist on the same physical server. === Applications === Cloud computing provides large-scale computing thanks to its ability to provide the needed CPU and storage resources to the user with complete transparency. This makes cloud computing particularly suited to support different types of applications that require large-scale distributed processing. This data-intensive computing needs a high performance file system that can share data between virtual machines (VM). Cloud computing dynamically allocates the needed resources, releasing them once a task is finished, requiring users to pay only for needed services, often via a service-level agreement. Cloud computing and cluster computing paradigms are becoming increasingly important to industrial data processing and scientific applications such as astronomy and physics, which frequently require the availability of large numbers of computers to carry out experiments. == Architectures == Most distributed file systems are built on the client-server architecture, but other, decentralized, solutions exist as well. === Client-server architecture === Network File System (NFS) uses a client-server architecture, which allows sharing of files between a number of machines on a network as if they were located locally, providing a standardized view. The NFS protocol allows heterogeneous clients' processes, probably running on different machines and under different operating systems, to access files on a distant server, ignoring the actual location of files. Relying on a single server results in the NFS protocol suffering from potentially low availability and poor scalability. Using multiple servers does not solve the availability problem since each server is working independently. The model of NFS is a remote file service. This model is also called the remote access model, which is in contrast with the upload/download model: Remote access model: Provides transparency, the client has access to a file. He sends requests to the remote file (while the file remains on the server). Upload/download model: The client can access the file only locally. It means that the client has to download the file, make modifications, and upload it again, to be used by others' clients. The file system used by NFS is almost the same as the one used by Unix systems. Files are hierarchically organized into a naming graph in which directories and files are represented by nodes. === Cluster-based architectures === A cluster-based architecture ameliorates some of the issues in client-server architectures, improving the execution of applications in parallel. The technique used here is file-striping: a file is split into multiple chunks, which are "striped" across several storage servers. The goal is to allow access to different parts of a file in parallel. If the application does not benefit from this technique, then it would be more convenient to store different files on different servers. However, when it comes to organizing a distributed file system for large data centers, such as Amazon and Google, that offer services to web clients allowing multiple operations (reading, updating, deleting,...) to a large number of files distributed among a large number of computers, then cluster-based solutions become more beneficial. Note that having a large number of computers may mean more hardware failures. Two of the most widely used distributed file systems (DFS) of this type are the Google File System (GFS) and the Hadoop Distributed File System (HDFS). The file systems of both are implemented by user level processes running on top of a standard operating system (Linux in the case of GFS). ==== Design principles ==== ===== Goals ===== Google File System (GFS) and Hadoop Distributed File System (HDFS) are specifically built for handling batch processing on very large data sets. For that, the following hypotheses must be taken into account: High availability: the cluster can contain thousands of file servers and some of them can be down at any time A server belongs to a rack, a room, a data center, a country, and a continent, in order to precisely identify its geographical location The size of a file can vary from many gigabytes to many terabytes. The file system should be able to support a massive number of files The need to support append operations and allow file contents to be visible even while a file is being written Communication is reliable among working machines: TCP/IP is used with a remote procedure call RPC communication abstraction. TCP allows the client to know almost immediately when there is a problem and a need to make a new connection. ===== Load balancing ===== Load balancing is essential for efficient operation in distributed environments. It means distributing work among different servers, fairly, in order to get more work done in the same amount of time and to serve clients faster. In a system containing N chunkservers in a cloud (N being 1000, 10000, or more), where a certain number of files are stored, each file is split into several parts or chunks of fixed size (for example, 64 megabytes), the load of each chunkserver being proportional to the number of chunks hosted by the server. In a load-balanced cloud, resources can be efficiently used while maximizing the performance of MapReduce-based applications. ===== Load rebalancing ===== In a cloud computing environment, failure is the norm, and chunkservers may be upgraded, replaced, and added to the system. Files can also be dynamically created, deleted, and appended. That leads to load imbalance in a distributed file system, meaning that the file chunks are not distributed equitably between the servers. Distributed file systems in clouds such as GFS and HDFS rely on central or master servers or nodes (Master for GFS and NameNode for HDFS) to manage the metadata and the load balancing. The master rebalances replicas periodically: data must be moved from one DataNode/chunkserver to another if free space on the first server falls below a certain threshold. However, this centralized approach can become a bottleneck for those master servers, if they become unable to manage a large number of file accesses, as it increases their already heavy loads. The load rebalance problem is NP-hard. In order to get a large number of chunkservers to work in collaboration, and to

Latent class model

In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved (or latent). Latent class analysis (LCA) is a subset of structural equation modeling used to find groups or subtypes of cases in multivariate categorical data. These groups or subtypes of cases are called "latent classes". When faced with the following situation, a researcher might opt to use LCA to better understand the data: Symptoms a, b, c, and d have been recorded in a variety of patients diagnosed with diseases X, Y, and Z. Disease X is associated with symptoms a, b, and c; disease Y is linked to symptoms b, c, and d; and disease Z is connected to symptoms a, c, and d. In this context, the LCA would attempt to detect the presence of latent classes (i.e., the disease entities), thus creating patterns of association in the symptoms. As in factor analysis, LCA can also be used to classify cases according to their maximum likelihood class membership probability. The key criterion for resolving the LCA is identifying latent classes in which the observed symptom associations are effectively rendered null. This is because within each class, the diseases responsible for the symptoms create a structure of dependencies. As a result, the symptoms become conditionally independent, meaning that, given the class a case belongs to, the symptoms are no longer related to one another. == Model == Within each latent class, the observed variables are statistically independent—an essential aspect of latent class modeling. Usually, the observed variables are statistically dependent. By introducing the latent variable, independence is restored in the sense that within classes, variables are independent (local independence). Therefore, the association between the observed variables is explained by the classes of the latent variable (McCutcheon, 1987). In one form, the LCM is written as p i 1 , i 2 , … , i N ≈ ∑ t T p t ∏ n N p i n , t n , {\displaystyle p_{i_{1},i_{2},\ldots ,i_{N}}\approx \sum _{t}^{T}p_{t}\,\prod _{n}^{N}p_{i_{n},t}^{n},} where T {\displaystyle T} is the number of latent classes and p t {\displaystyle p_{t}} are the so-called recruitment or unconditional probabilities that should sum to one. p i n , t n {\displaystyle p_{i_{n},t}^{n}} are the marginal or conditional probabilities. For a two-way latent class model, the form is p i j ≈ ∑ t T p t p i t p j t . {\displaystyle p_{ij}\approx \sum _{t}^{T}p_{t}\,p_{it}\,p_{jt}.} This two-way model is related to probabilistic latent semantic analysis and non-negative matrix factorization. The probability model used in LCA is closely related to the Naive Bayes classifier. The main difference is that in LCA, the class membership of an individual is a latent variable, whereas in Naive Bayes classifiers, the class membership is an observed label. == Related methods == There are a number of methods with distinct names and uses that share a common relationship. Cluster analysis is, like LCA, used to discover taxon-like groups of cases in data. Multivariate mixture estimation (MME) is applicable to continuous data and assumes that such data arise from a mixture of distributions, such as a set of heights arising from a mixture of men and women. If a multivariate mixture estimation is constrained so that measures must be uncorrelated within each distribution, it is termed latent profile analysis. Modified to handle discrete data, this constrained analysis is known as LCA. Discrete latent trait models further constrain the classes to form from segments of a single dimension, allocating members to classes based on that dimension. An example would be assigning cases to social classes based on ability or merit. In a practical instance, the variables could be multiple choice items of a political questionnaire. In this case, the data consists of an N-way contingency table with answers to the items for a number of respondents. In this example, the latent variable refers to political opinion, and the latent classes to political groups. Given group membership, the conditional probabilities specify the chance that certain answers are chosen. == Application == LCA may be used in many fields, such as: collaborative filtering, Behavior Genetics and Evaluation of diagnostic tests.