Neural scaling law

Neural scaling law

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost. Some models also exhibit performance gains by scaling inference through increased test-time compute (TTC), extending neural scaling laws beyond training to the deployment phase. == Introduction == In general, a deep learning model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate). Each of these variables can be defined as a real number, usually written as N , D , C , L {\displaystyle N,D,C,L} (respectively: parameter count, dataset size, computing cost, and loss). A neural scaling law is a theoretical or empirical statistical law between these parameters. There are also other parameters with other scaling laws. === Size of the model === In most cases, the model's size is simply the number of parameters. However, one complication arises with the use of sparse models, such as mixture-of-expert models. With sparse models, during inference, only a fraction of their parameters are used. In comparison, most other kinds of neural networks, such as transformer models, always use all their parameters during inference. === Size of the training dataset === The size of the training dataset is usually quantified by the number of data points within it. Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn. This can lead to improved generalization performance when the model is applied to new, unseen data. However, increasing the size of the training dataset also increases the computational resources and time required for model training. With the "pretrain, then finetune" method used for most large language models, there are two kinds of training dataset: the pretraining dataset and the finetuning dataset. Their sizes have different effects on model performance. Generally, the finetuning dataset is less than 1% the size of pretraining dataset. In some cases, a small amount of high quality data suffices for finetuning, and more data does not necessarily improve performance. Many scaling laws, due to their inherent diminishing returns nature, value data based on a submodular set function which was shown in a paper on this topic. === Cost of training === Training cost is typically measured in terms of time (how long it takes to train the model) and computational resources (how much processing power and memory are required). It is important to note that the cost of training can be significantly reduced with efficient training algorithms, optimized software libraries, and parallel computing on specialized hardware such as GPUs or TPUs. The cost of training a neural network model is a function of several factors, including model size, training dataset size, the training algorithm complexity, and the computational resources available. In particular, doubling the training dataset size does not necessarily double the cost of training, because one may train the model for several times over the same dataset (each being an "epoch"). === Performance === The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data. Common metrics for evaluating model performance include: Negative log-likelihood per token (logarithm of perplexity) for language modeling; Accuracy, precision, recall, and F1 score for classification tasks; Mean squared error (MSE) or mean absolute error (MAE) for regression tasks; Elo rating in a competition against other models, such as gameplay or preference by a human judge. Performance can be improved by using more data, larger models, different training algorithms, regularizing the model to prevent overfitting, and early stopping using a validation set. When the performance is a number bounded within the range of [ 0 , 1 ] {\displaystyle [0,1]} , such as accuracy, precision, etc., it often scales as a sigmoid function of cost, as seen in the figures. == Examples == === (Hestness, Narang, et al, 2017) === The 2017 paper is a common reference point for neural scaling laws fitted by statistical analysis on experimental data. Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale. Whereas previous works generally found the scaling exponent to scale like L ∝ D − α {\displaystyle L\propto D^{-\alpha }} , with α ∈ { 0.5 , 1 , 2 } {\displaystyle \alpha \in \{0.5,1,2\}} , the paper found that α ∈ [ 0.07 , 0.35 ] {\displaystyle \alpha \in [0.07,0.35]} . Of the factors they varied, only task can change the exponent α {\displaystyle \alpha } . Changing the architecture optimizers, regularizers, and loss functions, would only change the proportionality factor, not the exponent. For example, for the same task, one architecture might have L = 1000 D − 0.3 {\displaystyle L=1000D^{-0.3}} while another might have L = 500 D − 0.3 {\displaystyle L=500D^{-0.3}} . They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like N ∝ D β {\displaystyle N\propto D^{\beta }} for another exponent β {\displaystyle \beta } . They studied machine translation with LSTM ( α ∼ 0.13 {\displaystyle \alpha \sim 0.13} ), generative language modelling with LSTM ( α ∈ [ 0.06 , 0.09 ] , β ≈ 0.7 {\displaystyle \alpha \in [0.06,0.09],\beta \approx 0.7} ), ImageNet classification with ResNet ( α ∈ [ 0.3 , 0.5 ] , β ≈ 0.6 {\displaystyle \alpha \in [0.3,0.5],\beta \approx 0.6} ), and speech recognition with two hybrid (LSTMs complemented by either CNNs or an attention decoder) architectures ( α ≈ 0.3 {\displaystyle \alpha \approx 0.3} ). === (Henighan, Kaplan, et al, 2020) === A 2020 analysis studied statistical relations between C , N , D , L {\displaystyle C,N,D,L} over a wide range of values and found similar scaling laws, over the range of N ∈ [ 10 3 , 10 9 ] {\displaystyle N\in [10^{3},10^{9}]} , C ∈ [ 10 12 , 10 21 ] {\displaystyle C\in [10^{12},10^{21}]} , and over multiple modalities (text, video, image, text to image, etc.). In particular, the scaling laws it found are (Table 1 of ): For each modality, they fixed one of the two C , N {\displaystyle C,N} , and varying the other one ( D {\displaystyle D} is varied along using D = C / 6 N {\displaystyle D=C/6N} ), the achievable test loss satisfies L = L 0 + ( x 0 x ) α {\displaystyle L=L_{0}+\left({\frac {x_{0}}{x}}\right)^{\alpha }} where x {\displaystyle x} is the varied variable, and L 0 , x 0 , α {\displaystyle L_{0},x_{0},\alpha } are parameters to be found by statistical fitting. The parameter α {\displaystyle \alpha } is the most important one. When N {\displaystyle N} is the varied variable, α {\displaystyle \alpha } ranges from 0.037 {\displaystyle 0.037} to 0.24 {\displaystyle 0.24} depending on the model modality. This corresponds to the α = 0.34 {\displaystyle \alpha =0.34} from the Chinchilla scaling paper. When C {\displaystyle C} is the varied variable, α {\displaystyle \alpha } ranges from 0.048 {\displaystyle 0.048} to 0.19 {\displaystyle 0.19} depending on the model modality. This corresponds to the β = 0.28 {\displaystyle \beta =0.28} from the Chinchilla scaling paper. Given fixed computing budget, optimal model parameter count is consistently around N o p t ( C ) = ( C 5 × 10 − 12 petaFLOP-day ) 0.7 = 9.0 × 10 − 7 C 0.7 {\displaystyle N_{opt}(C)=\left({\frac {C}{5\times 10^{-12}{\text{petaFLOP-day}}}}\right)^{0.7}=9.0\times 10^{-7}C^{0.7}} The parameter 9.0 × 10 − 7 {\displaystyle 9.0\times 10^{-7}} varies by a factor of up to 10 for different modalities. The exponent parameter 0.7 {\displaystyle 0.7} varies from 0.64 {\displaystyle 0.64} to 0.75 {\displaystyle 0.75} for different modalities. This exponent corresponds to the ≈ 0.5 {\displaystyle \approx 0.5} from the Chinchilla scaling paper. It's "strongly suggested" (but not statistically checked) that D o p t ( C ) ∝ N o p t ( C ) 0.4 ∝ C 0.28 {\displaystyle D_{opt}(C)\propto N_{opt}(C)^{0.4}\propto C^{0.28}} . This exponent corresponds to the ≈ 0.5 {\displaystyle \approx 0.5} from the Chinchilla scaling paper. The scaling law of L = L 0 + ( C 0 / C ) 0.048 {\displaystyle L=L_{0}+(C_{0}/C)^{0.048}} was confirmed during the training of GPT-3 (Figure 3.1 ). === Chinchilla scaling (Hoffmann, et al, 2022) === One particular scaling law ("Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have: { C = C 0 N D L = A N α + B D β + L 0 {\displaystyle {\begin{cases}C=C_{0}ND\\L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}\end{cases}}} where the variables are C {\displaystyle C} is the cost o

Spherical basis

In pure and applied mathematics, particularly quantum mechanics and computer graphics and their applications, a spherical basis is the basis used to express spherical tensors. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. While spherical polar coordinates are one orthogonal coordinate system for expressing vectors and tensors using polar and azimuthal angles and radial distance, the spherical basis are constructed from the standard basis and use complex numbers. == In three dimensions == A vector A in 3D Euclidean space R3 can be expressed in the familiar Cartesian coordinate system in the standard basis ex, ey, ez, and coordinates Ax, Ay, Az: or any other coordinate system with associated basis set of vectors. From this extend the scalars to allow multiplication by complex numbers, so that we are now working in C 3 {\displaystyle \mathbb {C} ^{3}} rather than R 3 {\displaystyle \mathbb {R} ^{3}} . === Basis definition === In the spherical bases denoted e+, e−, e0, and associated coordinates with respect to this basis, denoted A+, A−, A0, the vector A is: where the spherical basis vectors can be defined in terms of the Cartesian basis using complex-valued coefficients in the xy plane: in which i {\displaystyle i} denotes the imaginary unit, and one normal to the plane in the z direction: e 0 = e z {\displaystyle \mathbf {e} _{0}=\mathbf {e} _{z}} The inverse relations are: === Commutator definition === While giving a basis in a 3-dimensional space is a valid definition for a spherical tensor, it only covers the case for when the rank k {\displaystyle k} is 1. For higher ranks, one may use either the commutator, or rotation definition of a spherical tensor. The commutator definition is given below, any operator T q ( k ) {\displaystyle T_{q}^{(k)}} that satisfies the following relations is a spherical tensor: [ J ± , T q ( k ) ] = ℏ ( k ∓ q ) ( k ± q + 1 ) T q ± 1 ( k ) {\displaystyle [J_{\pm },T_{q}^{(k)}]=\hbar {\sqrt {(k\mp q)(k\pm q+1)}}T_{q\pm 1}^{(k)}} [ J z , T q ( k ) ] = ℏ q T q ( k ) {\displaystyle [J_{z},T_{q}^{(k)}]=\hbar qT_{q}^{(k)}} === Rotation definition === Analogously to how the spherical harmonics transform under a rotation, a general spherical tensor transforms as follows, when the states transform under the unitary Wigner D-matrix D ( R ) {\displaystyle {\mathcal {D}}(R)} , where R is a (3×3 rotation) group element in SO(3). That is, these matrices represent the rotation group elements. With the help of its Lie algebra, one can show these two definitions are equivalent. D ( R ) T q ( k ) D † ( R ) = ∑ q ′ = − k k T q ′ ( k ) D q ′ q ( k ) {\displaystyle {\mathcal {D}}(R)T_{q}^{(k)}{\mathcal {D}}^{\dagger }(R)=\sum _{q'=-k}^{k}T_{q'}^{(k)}{\mathcal {D}}_{q'q}^{(k)}} === Coordinate vectors === For the spherical basis, the coordinates are complex-valued numbers A+, A0, A−, and can be found by substitution of (3B) into (1), or directly calculated from the inner product ⟨, ⟩ (5): A 0 = ⟨ e 0 , A ⟩ = ⟨ e z , A ⟩ = A z {\displaystyle A_{0}=\left\langle \mathbf {e} _{0},\mathbf {A} \right\rangle =\left\langle \mathbf {e} _{z},\mathbf {A} \right\rangle =A_{z}} with inverse relations: In general, for two vectors with complex coefficients in the same real-valued orthonormal basis ei, with the property ei·ej = δij, the inner product is: where · is the usual dot product and the complex conjugate must be used to keep the magnitude (or "norm") of the vector positive definite. == Properties (three dimensions) == === Orthonormality === The spherical basis is an orthonormal basis, since the inner product ⟨, ⟩ (5) of every pair vanishes meaning the basis vectors are all mutually orthogonal: ⟨ e + , e − ⟩ = ⟨ e − , e 0 ⟩ = ⟨ e 0 , e + ⟩ = 0 {\displaystyle \left\langle \mathbf {e} _{+},\mathbf {e} _{-}\right\rangle =\left\langle \mathbf {e} _{-},\mathbf {e} _{0}\right\rangle =\left\langle \mathbf {e} _{0},\mathbf {e} _{+}\right\rangle =0} and each basis vector is a unit vector: ⟨ e + , e + ⟩ = ⟨ e − , e − ⟩ = ⟨ e 0 , e 0 ⟩ = 1 {\displaystyle \left\langle \mathbf {e} _{+},\mathbf {e} _{+}\right\rangle =\left\langle \mathbf {e} _{-},\mathbf {e} _{-}\right\rangle =\left\langle \mathbf {e} _{0},\mathbf {e} _{0}\right\rangle =1} hence the need for the normalizing factors of 1 / 2 {\displaystyle 1/\!{\sqrt {2}}} . === Change of basis matrix === The defining relations (3A) can be summarized by a transformation matrix U: ( e + e − e 0 ) = U ( e x e y e z ) , U = ( − 1 2 − i 2 0 + 1 2 − i 2 0 0 0 1 ) , {\displaystyle {\begin{pmatrix}\mathbf {e} _{+}\\\mathbf {e} _{-}\\\mathbf {e} _{0}\end{pmatrix}}=\mathbf {U} {\begin{pmatrix}\mathbf {e} _{x}\\\mathbf {e} _{y}\\\mathbf {e} _{z}\end{pmatrix}}\,,\quad \mathbf {U} ={\begin{pmatrix}-{\frac {1}{\sqrt {2}}}&-{\frac {i}{\sqrt {2}}}&0\\+{\frac {1}{\sqrt {2}}}&-{\frac {i}{\sqrt {2}}}&0\\0&0&1\end{pmatrix}}\,,} with inverse: ( e x e y e z ) = U − 1 ( e + e − e 0 ) , U − 1 = ( − 1 2 + 1 2 0 + i 2 + i 2 0 0 0 1 ) . {\displaystyle {\begin{pmatrix}\mathbf {e} _{x}\\\mathbf {e} _{y}\\\mathbf {e} _{z}\end{pmatrix}}=\mathbf {U} ^{-1}{\begin{pmatrix}\mathbf {e} _{+}\\\mathbf {e} _{-}\\\mathbf {e} _{0}\end{pmatrix}}\,,\quad \mathbf {U} ^{-1}={\begin{pmatrix}-{\frac {1}{\sqrt {2}}}&+{\frac {1}{\sqrt {2}}}&0\\+{\frac {i}{\sqrt {2}}}&+{\frac {i}{\sqrt {2}}}&0\\0&0&1\end{pmatrix}}\,.} It can be seen that U is a unitary matrix, in other words its Hermitian conjugate U† (complex conjugate and matrix transpose) is also the inverse matrix U−1. For the coordinates: ( A + A − A 0 ) = U ∗ ( A x A y A z ) , U ∗ = ( − 1 2 + i 2 0 + 1 2 + i 2 0 0 0 1 ) , {\displaystyle {\begin{pmatrix}A_{+}\\A_{-}\\A_{0}\end{pmatrix}}=\mathbf {U} ^{\mathrm {} }{\begin{pmatrix}A_{x}\\A_{y}\\A_{z}\end{pmatrix}}\,,\quad \mathbf {U} ^{\mathrm {} }={\begin{pmatrix}-{\frac {1}{\sqrt {2}}}&+{\frac {i}{\sqrt {2}}}&0\\+{\frac {1}{\sqrt {2}}}&+{\frac {i}{\sqrt {2}}}&0\\0&0&1\end{pmatrix}}\,,} and inverse: ( A x A y A z ) = ( U ∗ ) − 1 ( A + A − A 0 ) , ( U ∗ ) − 1 = ( − 1 2 + 1 2 0 − i 2 − i 2 0 0 0 1 ) . {\displaystyle {\begin{pmatrix}A_{x}\\A_{y}\\A_{z}\end{pmatrix}}=(\mathbf {U} ^{\mathrm {} })^{-1}{\begin{pmatrix}A_{+}\\A_{-}\\A_{0}\end{pmatrix}}\,,\quad (\mathbf {U} ^{\mathrm {} })^{-1}={\begin{pmatrix}-{\frac {1}{\sqrt {2}}}&+{\frac {1}{\sqrt {2}}}&0\\-{\frac {i}{\sqrt {2}}}&-{\frac {i}{\sqrt {2}}}&0\\0&0&1\end{pmatrix}}\,.} === Cross products === Taking cross products of the spherical basis vectors, we find an obvious relation: e q × e q = 0 {\displaystyle \mathbf {e} _{q}\times \mathbf {e} _{q}={\boldsymbol {0}}} where q is a placeholder for +, −, 0, and two less obvious relations: e ± × e ∓ = ± i e 0 {\displaystyle \mathbf {e} _{\pm }\times \mathbf {e} _{\mp }=\pm i\mathbf {e} _{0}} e ± × e 0 = ± i e ± {\displaystyle \mathbf {e} _{\pm }\times \mathbf {e} _{0}=\pm i\mathbf {e} _{\pm }} === Inner product in the spherical basis === The inner product between two vectors A and B in the spherical basis follows from the above definition of the inner product: ⟨ A , B ⟩ = A + B + ⋆ + A − B − ⋆ + A 0 B 0 ⋆ {\displaystyle \left\langle \mathbf {A} ,\mathbf {B} \right\rangle =A_{+}B_{+}^{\star }+A_{-}B_{-}^{\star }+A_{0}B_{0}^{\star }}

Spleak

Spleak was an IM platform where users could publish and rate content. It existed in the form of six bots covering as many subject areas: CelebSpleak, SportSpleak, VoteSpleak, TVSpleak, GameSpleak, and StyleSpleak. == Overview == Users can add a "multi-Spleak" (which contains all of the different Spleak bots in one) or add the separate bots to their IM buddy lists on MSN and AIM. Users are also allowed access to Spleak online by using a CelebSpleak, SportSpleak, or VoteSpleak widget, or through the CelebSpleak and SportSpleak applications with Facebook. Spleak was an alternate reality game and is moving to its own company, Spleak Media Network. "Celebrate Spleak" was introduced throughout 2007, launched in 2008, and was forced to retire in 2009. == Key people == Spleak was co-founded by Morten Lund and Nicolaj Reffstrup. The company's chief executive officer is Morrie Eisenburg; Josh Scott is Vice President in Product and Tyler Wells is Vice President in Engineering.

Arabic Ontology

Arabic Ontology is a website offering linguistic ontology services for the Arabic language which can be used like the online site WordNet. Users can use Arabic Ontology to classify or clarify the concepts and meanings of Arabic terms. == Ontology Structure == The ontology structure (i.e., data model) is similar to WordNet's structure. Each concept in the database is given a unique concept identifier (URI), informally described by a gloss, and lexicalized by one or more synonymous lemma terms. Each term-concept pair is called a sense, and is given a SenseID. A set of senses is called synset. Concepts and senses are described by further attributes such as era and area — to specify example usage and ontological analysis. Semantic relations are defined between concepts. Some important entities are included in the ontology, such as individual countries and bodies of water. These individuals are given separate IndividualIDs and linked with their concepts through the InstanceOf relation. == Mappings to other resources == Concepts in the Arabic Ontology are mapped to synsets in WordNet, as well as to BFO and DOLCE. Terms used in the Arabic Ontology are mapped to lemmas in the LDC's SAMA database. == Applications == Arabic Ontology can be used in many application domains, such as: Information retrieval, to enrich queries (e.g., in search engines) and improve the quality of the results, i.e. meaningful search rather than string-matching search; Machine translation and word-sense disambiguation, by finding the exact mapping of concepts across languages, especially that the Arabic ontology is also mapped to the WordNet; Data Integration and interoperability in which the Arabic ontology can be used as a semantic reference to link databases and information systems; Semantic Web and Web 3.0, by using the Arabic ontology as a semantic reference to disambiguate the meanings used in websites; among many other applications. == URLs Design == The URLs in the Arabic Ontology are designed according to the W3C's Best Practices for Publishing Linked Data, as described in the following URL schemes. This allows one to also explore the whole database like exploring a graph: Ontology Concept: Each concept in the Arabic Ontology has a ConceptID and can be accessed using: https://{domain}/concept/{ConceptID | Term}. In case of a term, the set of concepts that this term lexicalizes are all retrieved. In case of a ConceptID, the concept and its direct subtypes are retrieved, e.g. https://ontology.birzeit.edu/concept/293198 Semantic relations: Relationships between concepts can be accessed using these schemes: (i) the URL: https:// {domain}/concept/{RelationName}/{ConceptID} allows retrieval of relationships among ontology concepts. (ii) the URL: https://{domain}/lexicalconcept/{RelationName}/{lexicalConceptID} allows retrieval of relations between lexical concepts. For example, https://ontology.birzeit.edu/concept/instances/293121 retrieves the instances of the concept 293121. The relations that are currently used in our database are: {subtypes, type, instances, parts, related, similar, equivalent}.

You Only Look Once

You Only Look Once (YOLO) is a series of real-time object detection systems based on convolutional neural networks. First introduced by Joseph Redmon et al. in 2015, YOLO has undergone several iterations and improvements, becoming one of the most popular object detection frameworks. The name "You Only Look Once" refers to the fact that the algorithm requires only one forward propagation pass through the neural network to make predictions, unlike previous region proposal-based techniques like R-CNN that require thousands for a single image. == Overview == Compared to previous methods like R-CNN and OverFeat, instead of applying the model to an image at multiple locations and scales, YOLO applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities. === OverFeat === OverFeat was an early influential model for simultaneous object classification and localization. Its architecture is as follows: Train a neural network for image classification only ("classification-trained network"). This could be one like the AlexNet. The last layer of the trained network is removed, and for every possible object class, initialize a network module at the last layer ("regression network"). The base network has its parameters frozen. The regression network is trained to predict the ( x , y ) {\displaystyle (x,y)} coordinates of two corners of the object's bounding box. During inference time, the classification-trained network is run over the same image over many different zoom levels and croppings. For each, it outputs a class label and a probability for that class label. Each output is then processed by the regression network of the corresponding class. This results in thousands of bounding boxes with class labels and probability. These boxes are merged until only one single box with a single class label remains. == Versions == There are two parts to the YOLO series. The original part contained YOLOv1, v2, and v3, all released on a website maintained by Joseph Redmon. === YOLOv1 === The original YOLO algorithm, introduced in 2015, divides the image into an S × S {\displaystyle S\times S} grid of cells. If the center of an object's bounding box falls into a grid cell, that cell is said to "contain" that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and how accurate it thinks the box is that it predicts. In more detail, the network performs the same convolutional operation over each of the S 2 {\displaystyle S^{2}} patches. The output of the network on each patch is a tuple as follows: ( p 1 , … , p C , c 1 , x 1 , y 1 , w 1 , h 1 , … , c B , x B , y B , w B , h B ) {\displaystyle (p_{1},\dots ,p_{C},c_{1},x_{1},y_{1},w_{1},h_{1},\dots ,c_{B},x_{B},y_{B},w_{B},h_{B})} where p i {\displaystyle p_{i}} is the conditional probability that the cell contains an object of class i {\displaystyle i} , conditional on the cell containing at least one object. x j , y j , w j , h j {\displaystyle x_{j},y_{j},w_{j},h_{j}} are the center coordinates, width, and height of the j {\displaystyle j} -th predicted bounding box that is centered in the cell. Multiple bounding boxes are predicted to allow each prediction to specialize in one kind of bounding box. For example, slender objects might be predicted by j = 2 {\displaystyle j=2} while stout objects might be predicted by j = 1 {\displaystyle j=1} . c j {\displaystyle c_{j}} is the predicted intersection over union (IoU) of each bounding box with its corresponding ground truth. The network architecture has 24 convolutional layers followed by 2 fully connected layers. During training, for each cell, if it contains a ground truth bounding box, then only the predicted bounding boxes with the highest IoU with the ground truth bounding boxes is used for gradient descent. Concretely, let j {\displaystyle j} be that predicted bounding box, and let i {\displaystyle i} be the ground truth class label, then x j , y j , w j , h j {\displaystyle x_{j},y_{j},w_{j},h_{j}} are trained by gradient descent to approach the ground truth, p i {\displaystyle p_{i}} is trained towards 1 {\displaystyle 1} , other p i ′ {\displaystyle p_{i'}} are trained towards zero. If a cell contains no ground truth, then only c 1 , c 2 , … , c B {\displaystyle c_{1},c_{2},\dots ,c_{B}} are trained by gradient descent to approach zero. === YOLOv2 === Released in 2016, YOLOv2 (also known as YOLO9000) improved upon the original model by incorporating batch normalization, a higher resolution classifier, and using anchor boxes to predict bounding boxes. It could detect over 9000 object categories. It was also released on GitHub under the Apache 2.0 license. === YOLOv3 === YOLOv3, introduced in 2018, contained only "incremental" improvements, including the use of a more complex backbone network, multiple scales for detection, and a more sophisticated loss function. === YOLOv4 and beyond === Subsequent versions of YOLO (v4, v5, etc.) have been developed by different researchers, further improving performance and introducing new features. These versions are not officially associated with the original YOLO authors but build upon their work. As of 2026, versions up to YOLO26 have been released..

Text Database and Dictionary of Classic Mayan

The project Text Database and Dictionary of Classic Mayan (abbr. TWKM) promotes research on the writing and language of pre-Hispanic Maya culture. It is housed in the Faculty of Arts at the University of Bonn and was established with funding from the North Rhine-Westphalian Academy of Sciences, Humanities and the Arts. The project has a projected run-time of fifteen years and is directed by Nikolai Grube from the Department of Anthropology of the Americas at the University of Bonn. The goal of the project is to conduct computer-based studies of all extant Maya hieroglyphic texts from an epigraphic and cultural-historical standpoint, and to produce and publish a database and a comprehensive dictionary of the Classic Mayan language. == Subject of the Project == The text database, as well as the dictionary that will be compiled by the conclusion of the project, will be assembled based on all known texts from the pre-Hispanic Maya culture. These texts were produced and used between approximately the third century B.C. through A.D. 1500, in a region that today includes parts of the countries of Mexico, Guatemala, Belize, and Honduras. The thousands of hieroglyphic inscriptions on monuments, ceramics, or daily objects that have survived into the present offer insight into the language's vocabulary and structure. The project's database and dictionary will digitally represent original spellings using the logo-syllabic Maya hieroglyphs, as well as their transcription and transliteration in the Roman alphabet. The data will be additionally annotated with various epigraphic analyses, translations, and further object-specific information. == Project Partners == TWKM will employ digital technologies in order to compile and make available the data and metadata, as well as to publish the project's research results. The project thereby methodologically positions itself in the field of the digital humanities. The project will be conducted in cooperation with the project partners (below), the research association for the eHumanities TextGrid, as well as the University and Regional Library of Bonn (ULB). The working environment that is currently under construction, in which the data and metadata will be compiled and annotated, will be realized in theTextGrid Laboratory, a software of the virtual research environment. A further component of this software, the TextGrid Repository, will make the data that are authorized for publication freely available online and ensure their long-term storage. The tools for data compilation and annotation attained from the modularly constructed and extended TextGrid lab thereby provide all the necessary materials for facilitating the research team's the typical epigraphic workflow. The workflow usually begins by documenting the texts and the objects on which they are preserved, and by compiling descriptive data. It then continues with the various levels of epigraphic and linguistic analysis, and concludes in the best case scenario with a translation of the analyzed inscription and a corresponding publication. In cooperation with the ULB, selected data will additionally be made available. The project's Virtual Inscription Archive will present online, in the Digital Collections of the ULB, hieroglyphic inscriptions selected from the published data in the repository, including an image of and brief information about the texts and the objects on which they are written, epigraphic analysis, and translation. == Project Goal == One of the project's goals is to produce a dictionary of Classic Mayan, in both digital and print form, towards the end of the project run-time. Additionally, a database with a corpus of inscriptions, including their translations and epigraphic analyses, will be made freely available online. The database furthermore will provide an ontology-like link of the contextual object data with the inscriptions and with each other, thereby allowing a cultural-historical arrangement of all contents within the periods of pre-Hispanic Maya culture. The contents of the database are additionally linked to citations of relevant literature. As a result, the database will also make freely available to both the scientific community and other interested parties a bibliography representing the research history and a base of knowledge concerning ancient Maya culture and script. In addition, the Classic Maya script, in its temporally defined stages of language development, will be gathered into and documented in a comprehensive language corpus with the aid of the information gathered by the project. In collaboration with all project participants, the corpus data can be used, together with the aid of various comparable analyses and also computational linguistic methods, such as inference-based methods, to confirm readings of some hieroglyphs that are currently only partially confirmed, and to eventually completely decipher the Classic Maya script.

Automatic acquisition of sense-tagged corpora

The knowledge acquisition bottleneck is perhaps the major impediment to solving the word-sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises. == Existing methods == Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically. WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD. The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are: acquisition by direct Web searching (searches for monosemous synonyms, hypernyms, hyponyms, parsed gloss' words, etc.), Yarowsky algorithm (bootstrapping), acquisition via Web directories, and acquisition via cross-language meaning evidences. == Summary == === Optimistic results === The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word-sense disambiguation. Some results are certainly encouraging: In some experiments, the quality of the Web data for WSD equals that of human-tagged examples. This is the case of the monosemous relatives plus bootstrapping with Semcor seeds technique and the examples taken from the ODP Web directories. In the first case, however, Semcor-size example seeds are necessary (and only available for English), and it has only been tested with a very limited set of nouns; in the second case, the coverage is quite limited, and it is not yet clear whether it can be grown without compromising the quality of the examples retrieved. It has been shown that a mainstream supervised learning technique trained exclusively with web data can obtain better results than all unsupervised WSD systems which participated at Senseval-2. Web examples made a significant contribution to the best Senseval-2 English all-words system. === Difficulties === There are, however, several open research issues related to the use of Web examples in WSD: High precision in the retrieved examples (i.e., correct sense assignments for the examples) does not necessarily lead to good supervised WSD results (i.e., the examples are possibly not useful for training). The most complete evaluation of Web examples for supervised WSD indicates that learning with Web data improves over unsupervised techniques, but the results are nevertheless far from those obtained with hand-tagged data, and do not even beat the most-frequent-sense baseline. Results are not always reproducible; the same or similar techniques may lead to different results in different experiments. Compare, for instance, Mihalcea (2002) with Agirre and Martínez (2004), or Agirre and Martínez (2000) with Mihalcea and Moldovan (1999). Results with Web data seem to be very sensitive to small differences in the learning algorithm, to when the corpus was extracted (search engines change continuously), and on small heuristic issues (e.g., differences in filters to discard part of the retrieved examples). Results are strongly dependent on bias (i.e., on the relative frequencies of examples per word sense). It is unclear whether this is simply a problem of Web data, or an intrinsic problem of supervised learning techniques, or just a problem of how WSD systems are evaluated (indeed, testing with rather small Senseval data may overemphasize sense distributions compared to sense distributions obtained from the full Web as corpus). In any case, Web data has an intrinsic bias, because queries to search engines directly constrain the context of the examples retrieved. There are approaches that alleviate this problem, such as using several different seeds/queries per sense or assigning senses to Web directories and then scanning directories for examples; but this problem is nevertheless far from being solved. Once a Web corpus of examples is built, it is not entirely clear whether its distribution is safe from a legal perspective. === Future === Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web: The Web as a social network has been successfully used for cooperative annotation of a corpus (OMWE, Open Mind Word Expert project), which has already been used in three Senseval-3 tasks (English, Romanian and Multilingual). The Web has been used to enrich WordNet senses with domain information: topic signatures and Web directories, which have in turn been successfully used for WSD. Also, some research benefited from the semantic information that the Wikipedia maintains on its disambiguation pages. It is clear, however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.