A data stream management system (DSMS) is a computer software system to manage continuous data streams. It is similar to a database management system (DBMS), which is, however, designed for static data in conventional databases. A DBMS also offers a flexible query processing so that the information needed can be expressed using queries. However, in contrast to a DBMS, a DSMS executes a continuous query that is not only performed once, but is permanently installed. Therefore, the query is continuously executed until it is explicitly uninstalled. Since most DSMS are data-driven, a continuous query produces new results as long as new data arrive at the system. This basic concept is similar to complex event processing so that both technologies are partially coalescing. == Functional principle == One important feature of a DSMS is the possibility to handle potentially infinite and rapidly changing data streams by offering flexible processing at the same time, although there are only limited resources such as main memory. The following table provides various principles of DSMS and compares them to traditional DBMS. == Processing and streaming models == One of the biggest challenges for a DSMS is to handle potentially infinite data streams using a fixed amount of memory and no random access to the data. There are different approaches to limit the amount of data in one pass, which can be divided into two classes. For the one hand, there are compression techniques that try to summarize the data and for the other hand there are window techniques that try to portion the data into (finite) parts. === Synopses === The idea behind compression techniques is to maintain only a synopsis of the data, but not all (raw) data points of the data stream. The algorithms range from selecting random data points called sampling to summarization using histograms, wavelets or sketching. One simple example of a compression is the continuous calculation of an average. Instead of memorizing each data point, the synopsis only holds the sum and the number of items. The average can be calculated by dividing the sum by the number. However, it should be mentioned that synopses cannot reflect the data accurately. Thus, a processing that is based on synopses may produce inaccurate results. === Windows === Instead of using synopses to compress the characteristics of the whole data streams, window techniques only look on a portion of the data. This approach is motivated by the idea that only the most recent data are relevant. Therefore, a window continuously cuts out a part of the data stream, e.g. the last ten data stream elements, and only considers these elements during the processing. There are different kinds of such windows like sliding windows that are similar to FIFO lists or tumbling windows that cut out disjoint parts. Furthermore, the windows can also be differentiated into element-based windows, e.g., to consider the last ten elements, or time-based windows, e.g., to consider the last ten seconds of data. There are also different approaches to implementing windows. There are, for example, approaches that use timestamps or time intervals for system-wide windows or buffer-based windows for each single processing step. Sliding-window query processing is also suitable to being implemented in parallel processors by exploiting parallelism between different windows and/or within each window extent. == Query processing == Since there are a lot of prototypes, there is no standardized architecture. However, most DSMS are based on the query processing in DBMS by using declarative languages to express queries, which are translated into a plan of operators. These plans can be optimized and executed. A query processing often consists of the following steps. === Formulation of continuous queries === The formulation of queries is mostly done using declarative languages like SQL in DBMS. Since there are no standardized query languages to express continuous queries, there are a lot of languages and variations. However, most of them are based on SQL, such as the Continuous Query Language (CQL), StreamSQL and ESP. There are also graphical approaches where each processing step is a box and the processing flow is expressed by arrows between the boxes. The language strongly depends on the processing model. For example, if windows are used for the processing, the definition of a window has to be expressed. In StreamSQL, a query with a sliding window for the last 10 elements looks like follows: This stream continuously calculates the average value of "price" of the last 10 tuples, but only considers those tuples whose prices are greater than 100.0. In the next step, the declarative query is translated into a logical query plan. A query plan is a directed graph where the nodes are operators and the edges describe the processing flow. Each operator in the query plan encapsulates the semantic of a specific operation, such as filtering or aggregation. In DSMSs that process relational data streams, the operators are equal or similar to the operators of the Relational algebra, so that there are operators for selection, projection, join, and set operations. This operator concept allows the very flexible and versatile processing of a DSMS. === Optimization of queries === The logical query plan can be optimized, which strongly depends on the streaming model. The basic concepts for optimizing continuous queries are equal to those from database systems. If there are relational data streams and the logical query plan is based on relational operators from the Relational algebra, a query optimizer can use the algebraic equivalences to optimize the plan. These may be, for example, to push selection operators down to the sources, because they are not so computationally intensive like join operators. Furthermore, there are also cost-based optimization techniques like in DBMS, where a query plan with the lowest costs is chosen from different equivalent query plans. One example is to choose the order of two successive join operators. In DBMS this decision is mostly done by certain statistics of the involved databases. But, since the data of a data streams is unknown in advance, there are no such statistics in a DSMS. However, it is possible to observe a data stream for a certain time to obtain some statistics. Using these statistics, the query can also be optimized later. So, in contrast to a DBMS, some DSMS allows to optimize the query even during runtime. Therefore, a DSMS needs some plan migration strategies to replace a running query plan with a new one. === Transformation of queries === Since a logical operator is only responsible for the semantics of an operation but does not consist of any algorithms, the logical query plan must be transformed into an executable counterpart. This is called a physical query plan. The distinction between a logical and a physical operator plan allows more than one implementation for the same logical operator. The join, for example, is logically the same, although it can be implemented by different algorithms like a Nested loop join or a Sort-merge join. Notice, these algorithms also strongly depend on the used stream and processing model. Finally, the query is available as a physical query plan. === Execution of queries === Since the physical query plan consists of executable algorithms, it can be directly executed. For this, the physical query plan is installed into the system. The bottom of the graph (of the query plan) is connected to the incoming sources, which can be everything like connectors to sensors. The top of the graph is connected to the outgoing sinks, which may be for example a visualization. Since most DSMSs are data-driven, a query is executed by pushing the incoming data elements from the source through the query plan to the sink. Each time when a data element passes an operator, the operator performs its specific operation on the data element and forwards the result to all successive operators. == Examples == AURORA, StreamBase Systems, Inc. Archived 23 March 2009 at the Wayback Machine Hortonworks DataFlow IBM Streams NIAGARA Query Engine NiagaraST: A Research Data Stream Management System at Portland State University Odysseus, an open source Java-based framework for Data Stream Management Systems Pipeline DB PIPES Archived 24 December 2016 at the Wayback Machine, webMethods Business Events QStream SAS Event Stream Processing SQLstream STREAM StreamGlobe StreamInsight TelegraphCQ WSO2 Stream Processor
Semantic neural network
Semantic neural network (SNN) is based on John von Neumann's neural network [von Neumann, 1966] and Nikolai Amosov M-Network. There are limitations to a link topology for the von Neumann’s network but SNN accept a case without these limitations. Only logical values can be processed, but SNN accept that fuzzy values can be processed too. All neurons into the von Neumann network are synchronized by tacts. For further use of self-synchronizing circuit technique SNN accepts neurons can be self-running or synchronized. In contrast to the von Neumann network there are no limitations for topology of neurons for semantic networks. It leads to the impossibility of relative addressing of neurons as it was done by von Neumann. In this case an absolute readdressing should be used. Every neuron should have a unique identifier that would provide a direct access to another neuron. Of course, neurons interacting by axons-dendrites should have each other's identifiers. An absolute readdressing can be modulated by using neuron specificity as it was realized for biological neural networks. There’s no description for self-reflectiveness and self-modification abilities into the initial description of semantic networks [Dudar Z.V., Shuklin D.E., 2000]. But in [Shuklin D.E. 2004] a conclusion had been drawn about the necessity of introspection and self-modification abilities in the system. For maintenance of these abilities a concept of pointer to neuron is provided. Pointers represent virtual connections between neurons. In this model, bodies and signals transferring through the neurons connections represent a physical body, and virtual connections between neurons are representing an astral body. It is proposed to create models of artificial neuron networks on the basis of virtual machine supporting the opportunity for paranormal effects. SNN is generally used for natural language processing. == Related models == Computational creativity Semantic hashing Semantic Pointer Architecture Sparse distributed memory
Ordination (statistics)
Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot. The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901. == Methods == Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group). The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction. The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods. == Applications == Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.
Ho–Kashyap algorithm
The Ho–Kashyap algorithm is an iterative method in machine learning for finding a linear decision boundary that separates two linearly separable classes. It was developed by Yu-Chi Ho and Rangasami L. Kashyap in 1965, and usually presented as a problem in linear programming. == Setup == Given a training set consisting of samples from two classes, the Ho–Kashyap algorithm seeks to find a weight vector w {\displaystyle \mathbf {w} } and a margin vector b {\displaystyle \mathbf {b} } such that: Y w = b {\displaystyle \mathbf {Yw} =\mathbf {b} } where Y {\displaystyle \mathbf {Y} } is the augmented data matrix with samples from both classes (with appropriate sign conventions, e.g., samples from class 2 are negated), w {\displaystyle \mathbf {w} } is the weight vector to be determined, and b {\displaystyle \mathbf {b} } is a positive margin vector. The algorithm minimizes the criterion function: J ( w , b ) = | | Y w − b | | 2 {\displaystyle J(\mathbf {w} ,\mathbf {b} )=||\mathbf {Yw} -\mathbf {b} ||^{2}} subject to the constraint that b > 0 {\displaystyle \mathbf {b} >\mathbf {0} } (element-wise). Given a problem of linearly separating two classes, we consider a dataset of elements { ( x i , y i ) } i ∈ 1 : N {\displaystyle \{(\mathbf {x_{i}} ,y_{i})\}_{i\in 1:N}} where y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} . Linearly separating them by a perceptron is equivalent to finding weight and bias w , b {\displaystyle \mathbf {w} ,b} for a perceptron, such that: [ y 1 x 1 1 ⋮ ⋮ y N x N 1 ] [ w b ] > 0 {\displaystyle {\begin{bmatrix}y_{1}\mathbf {x} _{1}&1\\\vdots &\vdots \\y_{N}\mathbf {x} _{N}&1\\\end{bmatrix}}{\begin{bmatrix}\mathbf {w} \\b\end{bmatrix}}>0} == Algorithm == The idea of the Ho–Kashyap algorithm is as follows: Given any b {\displaystyle \mathbf {b} } , the corresponding w {\displaystyle \mathbf {w} } is known: It is simply w = Y + b {\displaystyle \mathbf {w} =\mathbf {Y} ^{+}\mathbf {b} } , where Y + {\displaystyle \mathbf {Y} ^{+}} denotes the Moore–Penrose pseudoinverse of Y {\displaystyle \mathbf {Y} } . Therefore, it only remains to find b {\displaystyle \mathbf {b} } by gradient descent. However, the gradient descent may sometimes decrease some of the coordinates of b {\displaystyle \mathbf {b} } , which may cause some coordinates of b {\displaystyle \mathbf {b} } to become negative, which is undesirable. Therefore, whenever some coordinates of b {\displaystyle \mathbf {b} } would have decreased, those coordinates are unchanged instead. As for the coordinates of b {\displaystyle \mathbf {b} } that would increase, those would increase without issue. Formally, the algorithm is as follows: Initialization: Set b ( 0 ) {\displaystyle \mathbf {b} (0)} to an arbitrary positive vector, typically b ( 0 ) = 1 {\displaystyle \mathbf {b} (0)=\mathbf {1} } (a vector of ones). Set the iteration counter k = 0 {\displaystyle k=0} . Set w ( 0 ) = Y + b ( 0 ) {\displaystyle \mathbf {w} (0)=\mathbf {Y} ^{+}\mathbf {b} (0)} Loop until convergence, or until iteration counter exceeds some k m a x {\displaystyle k_{max}} . Error calculation: Compute the error vector: e ( k ) = Y w ( k ) − b ( k ) {\displaystyle \mathbf {e} (k)=\mathbf {Yw} (k)-\mathbf {b} (k)} . Margin update: Update the margin vector: b ( k + 1 ) = b ( k ) + 2 η k ( e ( k ) + | e ( k ) | ) {\displaystyle \mathbf {b} (k+1)=\mathbf {b} (k)+2\eta _{k}(\mathbf {e} (k)+|\mathbf {e} (k)|)} where η k {\displaystyle \eta _{k}} is a positive learning rate parameter, and | e ( k ) | {\displaystyle |\mathbf {e} (k)|} denotes the element-wise absolute value. Weight calculation: Compute the weight vector using the pseudoinverse: w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} . Convergence check: If | | e ( k ) | | ≤ θ {\displaystyle ||\mathbf {e} (k)||\leq \theta } for some predetermined threshold θ {\displaystyle \theta } (close to zero), then return b ( k + 1 ) , w ( k + 1 ) {\displaystyle \mathbf {b} (k+1),\mathbf {w} (k+1)} . if e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } (all components non-positive), return "Samples not separable.". Return "Algorithm failed to converge in time.". == Properties == If the training data is linearly separable, the algorithm converges to a solution (where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } ) in a finite number of iterations. If the data is not linearly separable, the algorithm may or may not ever reach the point where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } . However, if it does happen that e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } at some iteration, this proves non-separability. The convergence rate depends on the choice of the learning rate parameter ρ {\displaystyle \rho } and the degree of linear separability of the data. == Relationship to other algorithms == Perceptron algorithm: Both seek linear separators. The perceptron updates weights incrementally based on individual misclassified samples, while Ho–Kashyap is a batch method that processes all samples to compute the pseudoinverse and updates based on an overall error vector. Linear discriminant analysis (LDA): LDA assumes underlying Gaussian distributions with equal covariances for the classes and derives the decision boundary from these statistical assumptions. Ho–Kashyap makes no explicit distributional assumptions and instead tries to solve a system of linear inequalities directly. Support vector machines (SVM): For linearly separable data, SVMs aim to find the maximum-margin hyperplane. The Ho–Kashyap algorithm finds a separating hyperplane but not necessarily the one with the maximum margin. If the data is not separable, soft-margin SVMs allow for some misclassifications by optimizing a trade-off between margin size and misclassification penalty, while Ho–Kashyap provides a least-squares solution. == Variants == Modified Ho–Kashyap algorithm changes weight calculation step w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} to w ( k + 1 ) = w ( k ) + η k Y + | e ( k ) | {\displaystyle \mathbf {w} (k+1)=\mathbf {w} (k)+\eta _{k}\mathbf {Y} ^{+}|\mathbf {e} (k)|} . Kernel Ho–Kashyap algorithm: Applies kernel methods (the "kernel trick") to the Ho–Kashyap framework to enable non-linear classification by implicitly mapping data to a higher-dimensional feature space.
Multifactor dimensionality reduction
Multifactor dimensionality reduction (MDR) is a statistical approach, also used in machine learning automatic approaches, for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify nonadditive interactions among discrete variables that influence a binary outcome and is considered a nonparametric and model-free alternative to traditional statistical methods such as logistic regression. The basis of the MDR method is a constructive induction or feature engineering algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data. == Illustrative example == Consider the following simple example using the exclusive OR (XOR) function. XOR is a logical operator that is commonly used in data mining and machine learning as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2. Table 1 A machine learning algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data. Table 2 The machine learning algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data. Decision trees, neural networks, or a naive Bayes classifier could be used in combination with measures of model quality such as balanced accuracy and mutual information. == Machine learning with MDR == As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any machine learning algorithm there is always concern about overfitting. That is, machine learning algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value for the result. Replication in independent data may also provide evidence for an MDR model but can be sensitive to difference in the data sets. These approaches have all been shown to be useful for choosing and evaluating MDR models. An important step in a machine learning exercise is interpretation. Several approaches have been used with MDR including entropy analysis and pathway analysis. Tips and approaches for using MDR to model gene-gene interactions have been reviewed. == Extensions to MDR == Numerous extensions to MDR have been introduced. These include family-based methods, fuzzy methods, covariate adjustment, odds ratios, risk scores, survival methods, robust methods, methods for quantitative traits, and many others. == Applications of MDR == MDR has mostly been applied to detecting gene-gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation, autism, bladder cancer, breast cancer, cardiovascular disease, hypertension, obesity, pancreatic cancer, prostate cancer and tuberculosis. It has also been applied to other biomedical problems such as the genetic analysis of pharmacology outcomes. A central challenge is the scaling of MDR to big data such as that from genome-wide association studies (GWAS). Several approaches have been used. One approach is to filter the features prior to MDR analysis. This can be done using biological knowledge through tools such as BioFilter. It can also be done using computational tools such as ReliefF. Another approach is to use stochastic search algorithms such as genetic programming to explore the search space of feature combinations. Yet another approach is a brute-force search using high-performance computing. == Implementations == www.epistasis.org provides an open-source and freely-available MDR software package. An R package for MDR. An sklearn-compatible Python implementation. An R package for Model-Based MDR. MDR in Weka. Generalized MDR.
Autoscaling
Autoscaling, (also written as auto scaling, auto-scaling, or known as automatic scaling), is a method used in cloud computing that dynamically adjusts the amount of computational resources in a server farm - typically measured by the number of active servers - automatically based on the load on the farm. For example, the number of servers running behind a web application may be increased or decreased automatically based on the number of active users on the site. Since such metrics may change dramatically throughout the course of the day, and servers are a limited resource that cost money to run even while idle, there is often an incentive to run "just enough" servers to support the current load while still being able to support sudden and large spikes in activity. Autoscaling is helpful for such needs, as it can reduce the number of active servers when activity is low, and launch new servers when activity is high. Autoscaling is closely related to, and builds upon, the idea of load balancing. == Advantages == Autoscaling offers the following advantages: For companies running their own web server infrastructure, autoscaling typically means allowing some servers to go to sleep during times of low load, saving on electricity costs (as well as water costs if water is being used to cool the machines). For companies using infrastructure hosted in the cloud, autoscaling can mean lower bills, because most cloud providers charge based on total usage rather than maximum capacity. Even for companies that cannot reduce the total compute capacity they run or pay for at any given time, autoscaling can help by allowing the company to run less time-sensitive workloads on machines that get freed up by autoscaling during times of low traffic. Autoscaling solutions, such as the one offered by Amazon Web Services, can also take care of replacing unhealthy instances and therefore protecting somewhat against hardware, network, and application failures. Autoscaling can offer greater uptime and more availability in cases where production workloads are variable and unpredictable. Autoscaling differs from having a fixed daily, weekly, or yearly cycle of server use in that it is responsive to actual usage patterns, and thus reduces the potential downside of having too few or too many servers for the traffic load. For instance, if traffic is usually lower at midnight, then a static scaling solution might schedule some servers to sleep at night, but this might result in downtime on a night where people happen to use the Internet more (for instance, due to a viral news event). Autoscaling, on the other hand, can handle unexpected traffic spikes better. == Terminology == In the list below, we use the terminology used by Amazon Web Services (AWS). However, alternative names are noted and terminology that is specific to the names of Amazon services is not used for the names. == Practice == === Amazon Web Services (AWS) === Amazon Web Services launched the Amazon Elastic Compute Cloud (EC2) service in August 2006, that allowed developers to programmatically create and terminate instances (machines). At the time of initial launch, AWS did not offer autoscaling, but the ability to programmatically create and terminate instances gave developers the flexibility to write their own code for autoscaling. Third-party autoscaling software for AWS began appearing around April 2008. These included tools by Scalr and RightScale. RightScale was used by Animoto, which was able to handle Facebook traffic by adopting autoscaling. On May 18, 2009, Amazon launched its own autoscaling feature along with Elastic Load Balancing, as part of Amazon Elastic Compute Cloud. Autoscaling is now an integral component of Amazon's EC2 offering. Autoscaling on Amazon Web Services is done through a web browser or the command line tool. In May 2016 Autoscaling was also offered in AWS ECS Service. On-demand video provider Netflix documented their use of autoscaling with Amazon Web Services to meet their highly variable consumer needs. They found that aggressive scaling up and delayed and cautious scaling down served their goals of uptime and responsiveness best. In an article for TechCrunch, Zev Laderman, the co-founder and CEO of Newvem, a service that helps optimize AWS cloud infrastructure, recommended that startups use autoscaling in order to keep their Amazon Web Services costs low. Various best practice guides for AWS use suggest using its autoscaling feature even in cases where the load is not variable. That is because autoscaling offers two other advantages: automatic replacement of any instances that become unhealthy for any reason (such as hardware failure, network failure, or application error), and automatic replacement of spot instances that get interrupted for price or capacity reasons, making it more feasible to use spot instances for production purposes. Netflix's internal best practices require every instance to be in an autoscaling group, and its conformity monkey terminates any instance not in an autoscaling group in order to enforce this best practice. === Microsoft's Windows Azure === On June 27, 2013, Microsoft announced that it was adding autoscaling support to its Windows Azure cloud computing platform. Documentation for the feature is available on the Microsoft Developer Network. === Oracle Cloud === Oracle Cloud Platform allows server instances to automatically scale a cluster in or out by defining an auto-scaling rule. These rules are based on CPU and/or memory utilization and determine when to add or remove nodes. === Google Cloud Platform === On November 17, 2014, the Google Compute Engine announced a public beta of its autoscaling feature for use in Google Cloud Platform applications. As of March 2015, the autoscaling tool is still in Beta. === Facebook === In a blog post in August 2014, a Facebook engineer disclosed that the company had started using autoscaling to bring down its energy costs. The blog post reported a 27% decline in energy use for low traffic hours (around midnight) and a 10-15% decline in energy use over the typical 24-hour cycle. === Kubernetes Horizontal Pod Autoscaler === Kubernetes Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replicaset based on observed CPU utilization (or, with beta support, on some other, application-provided metrics) == Alternative autoscaling decision approaches == Autoscaling by default uses reactive decision approach for dealing with traffic scaling: scaling only happens in response to real-time changes in metrics. In some cases, particularly when the changes occur very quickly, this reactive approach to scaling is insufficient. Two other kinds of autoscaling decision approaches are described below. === Scheduled autoscaling approach === This is an approach to autoscaling where changes are made to the minimum size, maximum size, or desired capacity of the autoscaling group at specific times of day. Scheduled scaling is useful, for instance, if there is a known traffic load increase or decrease at specific times of the day, but the change is too sudden for reactive approach based autoscaling to respond fast enough. AWS autoscaling groups support scheduled scaling. === Predictive autoscaling === This approach to autoscaling uses predictive analytics. The idea is to combine recent usage trends with historical usage data as well as other kinds of data to predict usage in the future, and autoscale based on these predictions. For parts of their infrastructure and specific workloads, Netflix found that Scryer, their predictive analytics engine, gave better results than Amazon's reactive autoscaling approach. In particular, it was better for: Identifying huge spikes in demand in the near future and getting capacity ready a little in advance Dealing with large-scale outages, such as failure of entire availability zones and regions Dealing with variable traffic patterns, providing more flexibility on the rate of scaling out or in based on the typical level and rate of change in demand at various times of day On November 20, 2018, AWS announced that predictive scaling would be available as part of its autoscaling offering.
Teacher forcing
Teacher forcing is an algorithm for training the weights of recurrent neural networks (RNNs). It involves feeding observed sequence values (i.e. ground-truth samples) back into the RNN after each step, thus forcing the RNN to stay close to the ground-truth sequence. The term "teacher forcing" can be motivated by comparing the RNN to a human student taking a multi-part exam where the answer to each part (for example a mathematical calculation) depends on the answer to the preceding part. In this analogy, rather than grading every answer in the end, with the risk that the student fails every single part even though they only made a mistake in the first one, a teacher records the score for each individual part and then tells the student the correct answer, to be used in the next part. The use of an external teacher signal is in contrast to real-time recurrent learning (RTRL). Teacher signals are known from oscillator networks. The promise is, that teacher forcing helps to reduce the training time. The term "teacher forcing" was introduced in 1989 by Ronald J. Williams and David Zipser, who reported that the technique was already being "frequently used in dynamical supervised learning tasks" around that time. A NeurIPS 2016 paper introduced the related method of "professor forcing".