Pinakes

Pinakes

The Pinakes (Ancient Greek: Πίνακες 'tables', plural of πίναξ pinax) is a lost bibliographic work composed by Callimachus (310/305–240 BCE) that is popularly considered to be the first library catalog in the West; its contents were based upon the holdings of the Library of Alexandria during Callimachus's tenure there during the third century BCE. == History == The Library of Alexandria had been founded by Ptolemy I Soter about 306 BCE. The first recorded librarian was Zenodotus of Ephesus. During Zenodotus' tenure, Callimachus, who was never the head librarian, compiled many catalogues/lists, each called Pinakes. His most famous one listed authors and their works; thus he became the first known bibliographer and the scholar who organized the library by authors and subjects about 245 BCE. His work was 120 volumes long. Apollonius of Rhodes was the successor to Zenodotus. Eratosthenes of Cyrene succeeded Apollonius in 235 BCE and compiled his tetagmenos epi teis megaleis bibliothekeis, the 'scheme of the great bookshelves'. In 195 BCE Aristophanes of Byzantium, Eratosthenes' successor, was the librarian and updated the Pinakes, although it is also possible that his work was not a supplement of Callimachus' Pinakes themselves, but an independent polemic against, or commentary upon, their contents. == Description == The collection at the Library of Alexandria contained nearly 500,000 papyrus scrolls, which were grouped together by subject matter and stored in bins. Each bin carried a label with painted tablets hung above the stored papyri. Pinakes was named after these tablets and are a set of index lists. The bins gave bibliographical information for every roll. A typical entry started with a title and also provided the author's name, birthplace, father's name, any teachers trained under, and educational background. It contained a brief biography of the author and a list of the author's publications. The entry had the first line of the work, a summary of its contents, the name of the author, and information about the origin of the roll, as well as any doubts about the genuineness of the ascription. Callimachus' system divided works into six genres of poetry and five sections of prose: rhetoric, law, epic, tragedy, comedy, lyric poetry, history, medicine, mathematics, natural science, and miscellanies. Each category was alphabetized by author. Callimachus composed two other works that were referred as pinakes and were probably somewhat similar in format to the Pinakes (of which they "may or may not be subsections"), but were concerned with individual topics. These are listed by the Suda as: A Chronological Pinax and Description of Didaskaloi from the Beginning and Pinax of the Vocabulary and Treatises of Democritus. == Later bibliographic pinakes == The term pinax was used for bibliographic catalogs beyond Callimachus. For example, Ptolemy-el-Garib's catalog of Aristotle's writings comes to us with the title Pinax (catalog) of Aristotle's writings. == Legacy == The Pinakes proved indispensable to librarians for centuries, and they became a model for organizing knowledge throughout the Mediterranean. Their later influence can be traced to medieval times, even to the Arabic counterpart of the tenth century: Ibn al-Nadim's Al-Fihrist ("Index"). Local variations for cataloging and library classification continued through the late 19th century, when Anthony Panizzi and Melvil Dewey paved the way for more shared and standardized approaches.

Organoid intelligence

Organoid intelligence (OI) is an emerging field of study in computer science and biology that develops and studies biological wetware computing using 3D cultures of human brain cells (or brain organoids) and brain-machine interface technologies. Such technologies may be referred to as OIs or the nervous filesystem. Organoid intelligent computer systems can be an example of biohybrid systems. == Differences with non-organic computing == As opposed to traditional non-organic silicon-based approaches, OI seeks to use lab-grown cerebral organoids to serve as "biological hardware". While these structures are still far from being able to think like a regular human brain and do not yet possess strong computing capabilities, OI research currently offers the potential to improve the understanding of brain development, learning and memory, potentially finding treatments for neurological disorders such as dementia. Thomas Hartung, a professor from Johns Hopkins University, argued in 2023 that "while silicon-based computers are certainly better with numbers, brains are better at learning." He noted that transistor density in computer chip may be approaching its limits, whereas brains, being wired differently, are more energy-efficient and can store large amounts of information. Some researchers claim that even though human brains are slower than machines at processing simple information, they are far better at processing complex information as brains can deal with fewer and more uncertain data, perform both sequential and parallel processing, being highly heterogenous, use incomplete datasets, and is said to outperform non-organic machines in decision-making. Training OIs involve the process of biological learning (BL) as opposed to machine learning (ML) for AIs. == Bioinformatics in OI == OI generates complex biological data, necessitating sophisticated methods for processing and analysis. Bioinformatics provides the tools and techniques to decipher raw data, uncovering the patterns and insights. Researchers have developed a platform named Neuroplatform for experimenting remotely with brain organoids via an API. == Intended functions == Brain-inspired computing hardware aims to emulate the structure and working principles of the brain and could be used to address current limitations in AI technologies. However, brain-inspired silicon chips are still limited in their ability to fully mimic brain function, as most examples are built on digital electronic principles. One study performed OI computation (which they termed Brainoware) by sending and receiving information from the brain organoid using a high-density multielectrode array. By applying spatiotemporal electrical stimulation, nonlinear dynamics, and fading memory properties, as well as unsupervised learning from training data by reshaping the organoid functional connectivity, the study showed the potential of this technology by using it for speech recognition and nonlinear equation prediction in a reservoir computing framework. == Ethical concerns == While researchers are hoping to use OI and biological computing to complement traditional silicon-based computing, there are also questions about the ethics of such an approach. Concerns include the possibility that an organoid could develop sentience or consciousness, and the question of the relationship between a stem cell donor (for growing the organoid) and the respective OI system.

Constellation model

The constellation model is a probabilistic, generative model for category-level object recognition in computer vision. Like other part-based models, the constellation model attempts to represent an object class by a set of N parts under mutual geometric constraints. Because it considers the geometric relationship between different parts, the constellation model differs significantly from appearance-only, or "bag-of-words" representation models, which explicitly disregard the location of image features. The problem of defining a generative model for object recognition is difficult. The task becomes significantly complicated by factors such as background clutter, occlusion, and variations in viewpoint, illumination, and scale. Ideally, we would like the particular representation we choose to be robust to as many of these factors as possible. In category-level recognition, the problem is even more challenging because of the fundamental problem of intra-class variation. Even if two objects belong to the same visual category, their appearances may be significantly different. However, for structured objects such as cars, bicycles, and people, separate instances of objects from the same category are subject to similar geometric constraints. For this reason, particular parts of an object such as the headlights or tires of a car still have consistent appearances and relative positions. The Constellation Model takes advantage of this fact by explicitly modeling the relative location, relative scale, and appearance of these parts for a particular object category. Model parameters are estimated using an unsupervised learning algorithm, meaning that the visual concept of an object class can be extracted from an unlabeled set of training images, even if that set contains "junk" images or instances of objects from multiple categories. It can also account for the absence of model parts due to appearance variability, occlusion, clutter, or detector error. == History == The idea for a "parts and structure" model was originally introduced by Fischler and Elschlager in 1973. This model has since been built upon and extended in many directions. The Constellation Model, as introduced by Dr. Perona and his colleagues, was a probabilistic adaptation of this approach. In the late '90s, Burl et al. revisited the Fischler and Elschlager model for the purpose of face recognition. In their work, Burl et al. used manual selection of constellation parts in training images to construct a statistical model for a set of detectors and the relative locations at which they should be applied. In 2000, Weber et al. made the significant step of training the model using a more unsupervised learning process, which precluded the necessity for tedious hand-labeling of parts. Their algorithm was particularly remarkable because it performed well even on cluttered and occluded image data. Fergus et al. then improved upon this model by making the learning step fully unsupervised, having both shape and appearance learned simultaneously, and accounting explicitly for the relative scale of parts. == The method of Weber and Welling et al. == In the first step, a standard interest point detection method, such as Harris corner detection, is used to generate interest points. Image features generated from the vicinity of these points are then clustered using k-means or another appropriate algorithm. In this process of vector quantization, one can think of the centroids of these clusters as being representative of the appearance of distinctive object parts. Appropriate feature detectors are then trained using these clusters, which can be used to obtain a set of candidate parts from images. As a result of this process, each image can now be represented as a set of parts. Each part has a type, corresponding to one of the aforementioned appearance clusters, as well as a location in the image space. === Basic generative model === Weber & Welling here introduce the concept of foreground and background. Foreground parts correspond to an instance of a target object class, whereas background parts correspond to background clutter or false detections. Let T be the number of different types of parts. The positions of all parts extracted from an image can then be represented in the following "matrix," X o = ( x 11 , x 12 , ⋯ , x 1 N 1 x 21 , x 22 , ⋯ , x 2 N 2 ⋮ x T 1 , x T 2 , ⋯ , x T N T ) {\displaystyle X^{o}={\begin{pmatrix}x_{11},x_{12},{\cdots },x_{1N_{1}}\\x_{21},x_{22},{\cdots },x_{2N_{2}}\\\vdots \\x_{T1},x_{T2},{\cdots },x_{TN_{T}}\end{pmatrix}}} where N i {\displaystyle N_{i}\,} represents the number of parts of type i ∈ { 1 , … , T } {\displaystyle i\in \{1,\dots ,T\}} observed in the image. The superscript o indicates that these positions are observable, as opposed to missing. The positions of unobserved object parts can be represented by the vector x m {\displaystyle x^{m}\,} . Suppose that the object will be composed of F {\displaystyle F\,} distinct foreground parts. For notational simplicity, we assume here that F = T {\displaystyle F=T\,} , though the model can be generalized to F > T {\displaystyle F>T\,} . A hypothesis h {\displaystyle h\,} is then defined as a set of indices, with h i = j {\displaystyle h_{i}=j\,} , indicating that point x i j {\displaystyle x_{ij}\,} is a foreground point in X o {\displaystyle X^{o}\,} . The generative probabilistic model is defined through the joint probability density p ( X o , x m , h ) {\displaystyle p(X^{o},x^{m},h)\,} . === Model details === The rest of this section summarizes the details of Weber & Welling's model for a single component model. The formulas for multiple component models are extensions of those described here. To parametrize the joint probability density, Weber & Welling introduce the auxiliary variables b {\displaystyle b\,} and n {\displaystyle n\,} , where b {\displaystyle b\,} is a binary vector encoding the presence/absence of parts in detection ( b i = 1 {\displaystyle b_{i}=1\,} if h i > 0 {\displaystyle h_{i}>0\,} , otherwise b i = 0 {\displaystyle b_{i}=0\,} ), and n {\displaystyle n\,} is a vector where n i {\displaystyle n_{i}\,} denotes the number of background candidates included in the i t h {\displaystyle i^{th}} row of X o {\displaystyle X^{o}\,} . Since b {\displaystyle b\,} and n {\displaystyle n\,} are completely determined by h {\displaystyle h\,} and the size of X o {\displaystyle X^{o}\,} , we have p ( X o , x m , h ) = p ( X o , x m , h , n , b ) {\displaystyle p(X^{o},x^{m},h)=p(X^{o},x^{m},h,n,b)\,} . By decomposition, p ( X o , x m , h , n , b ) = p ( X o , x m | h , n , b ) p ( h | n , b ) p ( n ) p ( b ) {\displaystyle p(X^{o},x^{m},h,n,b)=p(X^{o},x^{m}|h,n,b)p(h|n,b)p(n)p(b)\,} The probability density over the number of background detections can be modeled by a Poisson distribution, p ( n ) = ∏ i = 1 T 1 n i ! ( M i ) n i e − M i {\displaystyle p(n)=\prod _{i=1}^{T}{\frac {1}{n_{i}!}}(M_{i})^{n_{i}}e^{-M_{i}}} where M i {\displaystyle M_{i}\,} is the average number of background detections of type i {\displaystyle i\,} per image. Depending on the number of parts F {\displaystyle F\,} , the probability p ( b ) {\displaystyle p(b)\,} can be modeled either as an explicit table of length 2 F {\displaystyle 2^{F}\,} , or, if F {\displaystyle F\,} is large, as F {\displaystyle F\,} independent probabilities, each governing the presence of an individual part. The density p ( h | n , b ) {\displaystyle p(h|n,b)\,} is modeled by p ( h | n , b ) = { 1 ∏ f = 1 F N f b f , if h ∈ H ( b , n ) 0 , for other h {\displaystyle p(h|n,b)={\begin{cases}{\frac {1}{\textstyle \prod _{f=1}^{F}N_{f}^{b_{f}}}},&{\mbox{if }}h\in H(b,n)\\0,&{\mbox{for other }}h\end{cases}}} where H ( b , n ) {\displaystyle H(b,n)\,} denotes the set of all hypotheses consistent with b {\displaystyle b\,} and n {\displaystyle n\,} , and N f {\displaystyle N_{f}\,} denotes the total number of detections of parts of type f {\displaystyle f\,} . This expresses the fact that all consistent hypotheses, of which there are ∏ f = 1 F N f b f {\displaystyle \textstyle \prod _{f=1}^{F}N_{f}^{b_{f}}} , are equally likely in the absence of information on part locations. And finally, p ( X o , x m | h , n ) = p f g ( z ) p b g ( x b g ) {\displaystyle p(X^{o},x^{m}|h,n)=p_{fg}(z)p_{bg}(x_{bg})\,} where z = ( x o x m ) {\displaystyle z=(x^{o}x^{m})\,} are the coordinates of all foreground detections, observed and missing, and x b g {\displaystyle x_{bg}\,} represents the coordinates of the background detections. Note that foreground detections are assumed to be independent of the background. p f g ( z ) {\displaystyle p_{fg}(z)\,} is modeled as a joint Gaussian with mean μ {\displaystyle \mu \,} and covariance Σ {\displaystyle \Sigma \,} . === Classification === The ultimate objective of this model is to classify images into classes "object present" (class C 1 {\displaystyle C_{1}\,} ) and "object absent" (class C 0 {\displaystyle C_{0}\,} ) given t

Ground truth

Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. The term ground truth appeared in remote sensing literature as early as 1972, when NASA described it as essential "data about ... materials on the earth's surface" used to calibrate measurements. It was later adopted by the statistical modeling and machine learning communities. == Etymology == The Oxford English Dictionary (s.v. ground truth) records the use of the word Groundtruth in the sense of 'fundamental truth' from Henry Ellison's poem "The Siberian Exile's Tale", published in 1833. == Usage == The term "ground truth" can be used as a noun, adjective, and verb. Noun: "ground truth" (no hyphen). Example: "The ground truth is essential for training accurate models." Adjective: "ground-truth" (hyphenated compound adjective). Example: "We need to use ground-truth data to validate the model." Verb: "to ground-truth" or "to groundtruth" (compound verb,). Example: "We need to ground-truth the results to ensure their accuracy." == Statistics and machine learning == In statistics and machine learning, ground truth is the ideal expected result, used in statistical models to prove or disprove research hypotheses. "Ground truthing" is the process of gathering the good data for this test. Ground truth is typically included in labeled data. In machine learning, "ground truth" is not necessarily objectively correct or true. For example, in training AI models or relevance rankers, it may be a set of judgments made by people or inferred from user behavior, which may depend on context. For example, in Bayesian spam filtering, a supervised learning system is typically trained by examples labeled as spam and non-spam. Although these labels may be subjective or inaccurate, they are considered ground truth. True ground truth in machine learning is objective data. For example, suppose we are testing a stereo vision system to see how well it can estimate 3D positions. A calibrated laser rangefinder may provide accurate distances as ground truth. == Remote sensing == In remote sensing, "ground truth" refers to information collected at the imaged location. Ground truth allows image data to be related to real features and materials on the ground. The collection of ground truth data enables calibration of remote-sensing data, and aids in the interpretation and analysis of what is being sensed. Examples include cartography, meteorology, analysis of aerial photographs, satellite imagery and other techniques in which data are gathered at a distance. More specifically, ground truth may refer to a process in which "pixels" on a satellite image are compared to what is imaged (at the time of capture) in order to verify the contents of the "pixels" in the image (noting that the concept of "pixel" is imaging-system-dependent). In the case of a classified image, supervised classification can help to determine the accuracy of the classification by the remote sensing system which can minimize error in the classification. Ground truth is usually done on site, correlating what is known with surface observations and measurements of various properties of the features of the ground resolution cells under study in the remotely sensed digital image. The process also involves taking geographic coordinates of the ground resolution cell with GPS technology and comparing those with the coordinates of the "pixel" being studied provided by the remote sensing software to understand and analyze the location errors and how it may affect a particular study. Ground truth is important in the initial supervised classification of an image. When the identity and location of land cover types are known through a combination of field work, maps, and personal experience these areas are known as training sites. The spectral characteristics of these areas are used to train the remote sensing software using decision rules for classifying the rest of the image. These decision rules such as Maximum Likelihood Classification, Parallelopiped Classification, and Minimum Distance Classification offer different techniques to classify an image. Additional ground truth sites allow the remote sensor to establish an error matrix that validates the accuracy of the classification method used. Different classification methods may have different percentages of error for a given classification project. It is important that the remote sensor chooses a classification method that works best with the number of classifications used while providing the least amount of error. Ground truth also helps with atmospheric correction. Since images from satellites have to pass through the atmosphere, they can get distorted because of absorption in the atmosphere. So ground truth can help fully identify objects in satellite photos. === Errors of commission === An example of an error of commission is when a pixel reports the presence of a feature (such a tree) that, in reality, is absent (no tree is actually present). Ground truthing ensures that the error matrices have a higher accuracy percentage than would be the case if no pixels were ground-truthed. This value is the complement of the user's accuracy, i.e. Commission Error = 1 - user's accuracy. === Errors of omission === An example of an error of omission is when pixels of a certain type, for example, maple trees, are not classified as maple trees. The process of ground-truthing helps to ensure that the pixel is classified correctly and the error matrices are more accurate. This value is the complement of the producer's accuracy, i.e. Omission Error = 1 - producer's accuracy == Geographical information systems == In GIS the spatial data is modeled as field (like in remote sensing raster images) or as object (like in vectorial map representation). They are modeled from the real world (also named geographical reality), typically by a cartographic process (illustrated). Geographic information systems such as GIS, GPS, and GNSS, have become so widespread that the term "ground truth" has taken on special meaning in that context. If the location coordinates returned by a location method such as GPS are an estimate of a location, then the "ground truth" is the actual location on Earth. A smart phone might return a set of estimated location coordinates such as 43.87870, −103.45901. The ground truth being estimated by those coordinates is the tip of George Washington's nose on Mount Rushmore. The accuracy of the estimate is the maximum distance between the location coordinates and the ground truth. We could say in this case that the estimate accuracy is 10 meters, meaning that the point on Earth represented by the location coordinates is thought to be within 10 meters of George's nose—the ground truth. In slang, the coordinates indicate where we think George Washington's nose is located, and the ground truth is where it really is. In practice a smart phone or hand-held GPS unit is routinely able to estimate the ground truth within 6–10 meters. Specialized instruments can reduce GPS measurement error to under a centimeter. == Military usage == US military slang uses "ground truth" to refer to the facts comprising a tactical situation—as opposed to intelligence reports, mission plans, and other descriptions reflecting the conative or policy-based projections of the industrial·military complex. The term appears in the title of the Iraq War documentary film The Ground Truth (2006), and also in military publications, for example Stars and Stripes saying: "Stripes decided to figure out what the ground truth was in Iraq."

Random projection

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. According to theoretical results, random projection preserves distances well, but empirical results are sparse. They have been applied to many natural language tasks under the name random indexing. == Dimensionality reduction == Dimensionality reduction, as the name suggests, is reducing the number of random variables using various mathematical methods from statistics and machine learning. Dimensionality reduction is often used to reduce the problem of managing and manipulating large data sets. Dimensionality reduction techniques generally use linear transformations in determining the intrinsic dimensionality of the manifold as well as extracting its principal directions. For this purpose there are various related techniques, including: principal component analysis, linear discriminant analysis, canonical correlation analysis, discrete cosine transform, random projection, etc. Random projection is a simple and computationally efficient way to reduce the dimensionality of data by trading a controlled amount of error for faster processing times and smaller model sizes. The dimensions and distribution of random projection matrices are controlled so as to approximately preserve the pairwise distances between any two samples of the dataset. == Method == The core idea behind random projection is given in the Johnson-Lindenstrauss lemma, which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves pairwise distances between the points with high probability. In random projection, the original d {\displaystyle d} -dimensional data is projected to a k {\displaystyle k} -dimensional subspace, by multiplying on the left by a random matrix R ∈ R k × d {\displaystyle R\in \mathbb {R} ^{k\times d}} . Using matrix notation: If X d × N {\displaystyle X_{d\times N}} is the original set of N d-dimensional observations, then X k × N R P = R k × d X d × N {\displaystyle X_{k\times N}^{RP}=R_{k\times d}X_{d\times N}} is the projection of the data onto a lower k-dimensional subspace. Random projection is computationally simple: form the random matrix "R" and project the d × N {\displaystyle d\times N} data matrix X onto K dimensions of order O ( d k N ) {\displaystyle O(dkN)} . If the data matrix X is sparse with about c nonzero entries per column, then the complexity of this operation is of order O ( c k N ) {\displaystyle O(ckN)} . === Orthogonal random projection === A unit vector can be orthogonally projected to a random subspace. Let u {\displaystyle u} be the original unit vector, and let v {\displaystyle v} be its projection. The norm-squared ‖ v ‖ 2 2 {\displaystyle \|v\|_{2}^{2}} has the same distribution as projecting a random point, uniformly sampled on the unit sphere, to its first k {\displaystyle k} coordinates. This is equivalent to sampling a random point in the multivariate gaussian distribution x ∼ N ( 0 , I d × d ) {\displaystyle x\sim {\mathcal {N}}(0,I_{d\times d})} , then normalizing it. Therefore, ‖ v ‖ 2 2 {\displaystyle \|v\|_{2}^{2}} has the same distribution as ∑ i = 1 k x i 2 ∑ i = 1 k x i 2 + ∑ i = k + 1 d x i 2 {\displaystyle {\frac {\sum _{i=1}^{k}x_{i}^{2}}{\sum _{i=1}^{k}x_{i}^{2}+\sum _{i=k+1}^{d}x_{i}^{2}}}} , which by the chi-squared construction of the Beta distribution, has distribution Beta ⁡ ( k / 2 , ( d − k ) / 2 ) {\displaystyle \operatorname {Beta} (k/2,(d-k)/2)} , with mean k / d {\displaystyle k/d} . We have a concentration inequality P r [ | ‖ v ‖ 2 − k d | ≥ ϵ k d ] ≤ 3 exp ⁡ ( − k ϵ 2 / 64 ) {\displaystyle Pr\left[\left|\|v\|_{2}-{\frac {k}{d}}\right|\geq \epsilon {\sqrt {\frac {k}{d}}}\right]\leq 3\exp \left(-k\epsilon ^{2}/64\right)} for any ϵ ∈ ( 0 , 1 ) {\displaystyle \epsilon \in (0,1)} . === Gaussian random projection === The random matrix R can be generated using a Gaussian distribution. The first row is a random unit vector uniformly chosen from S d − 1 {\displaystyle S^{d-1}} . The second row is a random unit vector from the space orthogonal to the first row, the third row is a random unit vector from the space orthogonal to the first two rows, and so on. In this way of choosing R, and the following properties are satisfied: Spherical symmetry: For any orthogonal matrix A ∈ O ( d ) {\displaystyle A\in O(d)} , RA and R have the same distribution. Orthogonality: The rows of R are orthogonal to each other. Normality: The rows of R are unit-length vectors. === More computationally efficient random projections === Achlioptas has shown that the random matrix can be sampled more efficiently. Either the full matrix can be sampled IID according to R i , j = 3 / k × { + 1 with probability 1 6 0 with probability 2 3 − 1 with probability 1 6 {\displaystyle R_{i,j}={\sqrt {3/k}}\times {\begin{cases}+1&{\text{with probability }}{\frac {1}{6}}\\0&{\text{with probability }}{\frac {2}{3}}\\-1&{\text{with probability }}{\frac {1}{6}}\end{cases}}} or the full matrix can be sampled IID according to R i , j = 1 / k × { + 1 with probability 1 2 − 1 with probability 1 2 {\displaystyle R_{i,j}={\sqrt {1/k}}\times {\begin{cases}+1&{\text{with probability }}{\frac {1}{2}}\\-1&{\text{with probability }}{\frac {1}{2}}\end{cases}}} Both are efficient for database applications because the computations can be performed using integer arithmetic. More related study is conducted in. It was later shown how to use integer arithmetic while making the distribution even sparser, having very few nonzeroes per column, in work on the Sparse JL Transform. This is advantageous since a sparse embedding matrix means being able to project the data to lower dimension even faster. === Random Projection with Quantization === Random projection can be further condensed by quantization (discretization), with 1-bit (sign random projection) or multi-bits. It is the building block of SimHash, RP tree, and other memory efficient estimation and learning methods. == Large quasiorthogonal bases == The Johnson-Lindenstrauss lemma states that large sets of vectors in a high-dimensional space can be linearly mapped in a space of much lower (but still high) dimension n with approximate preservation of distances. One of the explanations of this effect is the exponentially high quasiorthogonal dimension of n-dimensional Euclidean space. There are exponentially large (in dimension n) sets of almost orthogonal vectors (with small value of inner products) in n–dimensional Euclidean space. This observation is useful in indexing of high-dimensional data. Quasiorthogonality of large random sets is important for methods of random approximation in machine learning. In high dimensions, exponentially large numbers of randomly and independently chosen vectors from equidistribution on a sphere (and from many other distributions) are almost orthogonal with probability close to one. This implies that in order to represent an element of such a high-dimensional space by linear combinations of randomly and independently chosen vectors, it may often be necessary to generate samples of exponentially large length if we use bounded coefficients in linear combinations. On the other hand, if coefficients with arbitrarily large values are allowed, the number of randomly generated elements that are sufficient for approximation is even less than dimension of the data space. == Implementations == RandPro - An R package for random projection sklearn.random_projection - A module for random projection from the scikit-learn Python library Weka implementation [1]

Microsoft Azure

Microsoft Azure, sometimes stylized Azure, and formerly Windows Azure, is the cloud computing platform developed by Microsoft. It offers management, access and development of applications and services to individuals, companies, and governments through its global infrastructure. Microsoft Azure supports many programming languages, tools, and frameworks, including Microsoft-specific and third-party software and systems. Azure was first introduced at the Professional Developers Conference (PDC) in October 2008 under the codename "Project Red Dog". It was officially launched as Windows Azure in February 2010 and later renamed to Microsoft Azure on March 25, 2014. == Services == Microsoft Azure uses large-scale virtualization at Microsoft data centers worldwide and offers more than 600 services. Microsoft Azure offers a service level agreement (SLA) that guarantees 99.9% availability for applications and data hosted on its platform, subject to specific terms and conditions outlined in the SLA documentation. === Computer services === Virtual machines, infrastructure as a service (IaaS), allowing users to launch general-purpose Microsoft Windows and Linux virtual machines, software as a service (SaaS), as well as preconfigured machine images for popular software packages. Starting in 2022, these virtual machines are now powered by Ampere Cloud-native processors. Most users run Linux on Azure, some of the many Linux distributions offered, including Microsoft's own Linux-based Azure Sphere. App services, platform as a service (PaaS) environment, letting developers easily publish and manage websites. Azure Web Sites allows developers to build sites using ASP.NET, PHP, Node.js, Java, or Python, which can be deployed using FTP, Git, Mercurial, Azure DevOps, or uploaded through the user portal. This feature was announced in preview form in June 2012 at the Meet Microsoft Azure event. Customers can create websites in PHP, ASP.NET, Node.js, or Python, or select from several open-source applications from a gallery to deploy. This comprises one aspect of the platform as a service (PaaS) offerings for the Microsoft Azure Platform. It was renamed Web Apps in April 2015. Web Jobs are applications that can be deployed to an App Service environment to implement background processing that can be invoked on a schedule, on-demand, or run continuously. The Blob, Table, and Queue services can be used to communicate between Web Apps and Web Jobs and to provide state. Azure Kubernetes Service (AKS) provides the capability to deploy production-ready Kubernetes clusters in Azure. In July 2023, watermarking support on Azure Virtual Desktop was announced as an optional feature of Screen Capture to provide additional security against data leakage. === Identity === Entra ID connect is used to synchronize on-premises directories and enable SSO (Single Sign On). Entra ID B2C allows the use of consumer identity and access management in the cloud. Entra Domain Services is used to join Azure virtual machines to a domain without domain controllers. Azure information protection can be used to protect sensitive information. Entra ID External Identities is a set of capabilities that allow organizations to collaborate with external users, including customers and partners. On July 11, 2023, Microsoft announced the renaming of Azure AD to Microsoft Entra ID. The name change took place four days later. === Mobile services === Mobile Engagement collects real-time analytics that highlight users' behavior. It also provides push notifications to mobile devices. HockeyApp can be used to develop, distribute, and beta-test mobile apps. === Storage services === Storage Services provides REST and SDK APIs for storing and accessing data on the cloud. Table Service lets programs store structured text in partitioned collections of entities that are accessed by the partition key and primary key. Azure Table Service is a NoSQL non-relational database. Blob Service allows programs to store unstructured text and binary data as object storage blobs that can be accessed by an HTTP(S) path. Blob service also provides security mechanisms to control access to data. Queue Service lets programs communicate asynchronously by message using queues. File Service allows storing and access of data on the cloud using the REST APIs or the SMB protocol. === Communication services === Azure Communication Services offers an SDK for creating web and mobile communications applications that include SMS, video calling, VOIP and PSTN calling, and web-based chat. === Data management === Azure Data Explorer provides big data analytics and data-exploration capabilities. Azure Search provides text search and a subset of OData's structured filters using REST or SDK APIs. Cosmos DB is a NoSQL database service that implements a subset of the SQL SELECT statement on JSON documents. Azure Cache for Redis is a managed implementation of Redis. StorSimple manages storage tasks between on-premises devices and cloud storage. Azure SQL Database works to create, scale, and extend applications into the cloud using Microsoft SQL Server technology. It also integrates with Active Directory, Microsoft System Center, and Hadoop. Azure Synapse Analytics is a fully managed cloud data warehouse. Azure Data Factory is a data integration service that allows creation of data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. Azure Data Lake is a scalable data storage and analytic service for big data analytics workloads that require developers to run massively parallel queries. Azure HDInsight is a big data-relevant service that deploys Hortonworks Hadoop on Microsoft Azure and supports the creation of Hadoop clusters using Linux with Ubuntu. Azure Stream Analytics is a Serverless scalable event-processing engine that enables users to develop and run real-time analytics on multiple streams of data from sources such as devices, sensors, websites, social media, and other applications. === Messaging === The Microsoft Azure Service Bus allows applications running on Azure premises or off-premises devices to communicate with Azure. This helps to build scalable and reliable applications in a service-oriented architecture (SOA). The Azure service bus supports four different types of communication mechanisms: Event Hubs, which provides event and telemetry ingress to the cloud at a massive scale, with low latency and high reliability. For example, an event hub can be used to track data from cell phones such as coordinating with a GPS in real time. Queues, which allows one-directional communication. A sender application would send the message to the service bus queue and a receiver would read from the queue. Though there can be multiple readers for the queue, only one would process a single message. Topics, which provides one-directional communication using a subscriber pattern. It is similar to a queue; however, each subscriber will receive a copy of the message sent to a Topic. Optionally, the subscriber can filter out messages based on specific criteria defined by the subscriber. Relays, which provides bi-directional communication. Unlike queues and topics, a relay does not store in-flight messages in its memory; instead, it just passes them on to the destination application. === Media services === A PaaS offering that can be used for encoding, content protection, streaming, or analytics. === CDN === Azure has a worldwide content delivery network (CDN) designed to efficiently deliver audio, video, applications, images, and other static files. It improves the performance of websites by caching static files closer to users, based on their geographic location. Users can manage the network using a REST-based HTTP API. Azure has 118 point-of-presence locations across 100 cities worldwide (also known as Edge locations) as of January 2023. === Developer === Application Insights Azure DevOps === Management === With Azure Automation, users can easily automate repetitive and time-consuming tasks, often prone to cloud or enterprise setting errors. They can accomplish it using runbooks or desired state configurations for process automation. Microsoft SMA === Azure AI === Microsoft Azure Machine Learning (Azure ML) provides tools and frameworks for developers to create their own machine learning and artificial intelligence (AI) services. Azure AI Services by Microsoft comprises prebuilt APIs, SDKs, and services developers can customize. These services encompass perceptual and cognitive intelligence features such as speech recognition, speaker recognition, neural speech synthesis, face recognition, computer vision, OCR/form understanding, natural language processing, machine translation, and business decision services. Many AI characteristics in Microsoft's products and services, namely Bing, Office, Teams, Xbox, and Windows, are driven by Azure AI Services. Microsoft Foundry (formerly known as Azure AI Studio)

Density-based clustering validation

Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.