Nearest neighbor search

Nearest neighbor search

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M {\displaystyle q\in M} , find the closest point in S to q. Donald Knuth in volume 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points. Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. == Applications == The nearest neighbor search problem arises in numerous fields of application, including: Pattern recognition – in particular for optical character recognition Statistical classification – see k-nearest neighbor algorithm Computer vision – for point cloud registration Computational geometry – see Closest pair of points problem Cryptanalysis – for lattice problem Databases – e.g. content-based image retrieval Coding theory – see maximum likelihood decoding Semantic search Vector databases, where nearest-neighbor lookup over embeddings is used to retrieve semantically similar records Retrieval-augmented generation systems, where nearest-neighbor retrieval over embeddings is used to fetch candidate passages or documents before generation Data compression – see MPEG-2 standard Robotic sensing Recommendation systems, e.g. see Collaborative filtering Internet marketing – see contextual advertising and behavioral targeting DNA sequencing Spell checking – suggesting correct spelling Plagiarism detection Similarity scores for predicting career paths of professional athletes. Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance Chemical similarity Sampling-based motion planning == Methods == Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. === Exact methods === ==== Linear search ==== The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. ==== Space partitioning ==== Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R tree. R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods. Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. === Approximation methods === An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most c {\displaystyle c} times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. ==== Greedy search in proximity neighborhood graphs ==== Proximity graph methods (such as navigable small world graphs and HNSW) are considered the current state-of-the-art for the approximate nearest neighbors search. The methods are based on greedy traversing in proximity neighborhood graphs G ( V , E ) {\displaystyle G(V,E)} in which every point x i ∈ S {\displaystyle x_{i}\in S} is uniquely associated with vertex v i ∈ V {\displaystyle v_{i}\in V} . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph G ( V , E ) {\displaystyle G(V,E)} . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex v i ∈ V {\displaystyle v_{i}\in V} by computing the distances from the query q to each vertex of its neighborhood { v j : ( v i , v j ) ∈ E } {\displaystyle \{v_{j}:(v_{i},v_{j})\in E\}} , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, in the VoroNet syst

Cone tracing

Cone tracing and beam tracing are a derivative of the ray tracing algorithm that replaces rays, which have no thickness, with thick rays. == Principles == In ray tracing, rays are often modeled as geometric ray with no thickness to perform efficient geometric queries such as a ray-triangle intersection. From a physics of light transport point of view, however, this is an inaccurate model provided the pixel on the sensor plane has non-zero area. In the simplified pinhole camera optics model, the energy reaching the pixel comes from the integral of radiance from the solid angle by which the sensor pixel sees the scene through the pinhole at the focal plane. This yields the key notion of pixel footprint on surfaces or in the texture space, which is the back projection of the pixel on to the scene. Note that this approach can also represent a lens-based camera and thus depth of field effects, using a cone whose cross-section decreases from the lens size to zero at the focal plane, and then increases. Real optical system do not focus on exact points because of diffraction and imperfections. This can be modeled with a point spread function (PSF) weighted within a solid angle larger than the pixel. From a signal processing point of view, ignoring the point spread function and approximating the integral of radiance with a single, central sample (through a ray with no thickness) can lead to strong aliasing because the "projected geometric signal" has very high frequencies exceeding the Nyquist-Shannon maximal frequency that can be represented using the uniform pixel sampling rate. The physically based image formation model can be approximated by the convolution with the point spread function assuming the function is shift-invariant and linear. In practice, techniques such as multisample anti-aliasing estimate this cone-based model by oversampling the signal and then performing a convolution (the reconstruction filter). The backprojected cone footprint onto the scene can also be used to directly pre-filter the geometry and textures of the scene. Note that contrary to intuition, the reconstruction filter should not be the pixel footprint (as the pinhole camera model would suggest), since a box filter has poor spectral properties. Conversely, the ideal sinc function is not practical, having infinite support with possibly negative values which often creates ringing artifacts due to the Gibbs phenomenon. A Gaussian or a Lanczos filter are considered good compromises. == Computer graphics models == Cone and Beam early papers rely on different simplifications: the first considers a circular section and treats the intersection with various possible shapes. The second treats an accurate pyramidal beam through the pixel and along a complex path, but it only works for polyhedrical shapes. Cone tracing solves certain problems related to sampling and aliasing, which can plague conventional ray tracing. However, cone tracing creates a host of problems of its own. For example, just intersecting a cone with scene geometry leads to an enormous variety of possible results. For this reason, cone tracing has remained mostly unpopular. In recent years, increases in computer speed have made Monte Carlo algorithms like distributed ray tracing - i.e. stochastic explicit integration of the pixel - much more used than cone tracing because the results are exact provided enough samples are used. But the convergence is so slow that even in the context of off-line rendering a huge amount of time can be required to avoid noise. Differential cone-tracing, considering a differential angular neighborhood around a ray, avoids the complexity of exact geometry intersection but requires a LOD representation of the geometry and appearance of the objects. MIPmapping is an approximation of it limited to the integration of the surface texture within a cone footprint. Differential ray-tracing extends it to textured surfaces viewed through complex paths of cones reflected or refracted by curved surfaces. Raymarching methods over signed distance fields (SDFs) naturally allow easy use of cone-like tracing, at zero additional cost to the tracing, and both speeds up tracing and improves quality. Voxel cone tracing is a real-time algorithm that uses a hierarchical voxel representation of scene geometry, such as a sparse voxel octree, to support fast cone tracing for indirect illumination. This approach allows for the approximation of effects like glossy reflections and ambient occlusion at interactive framerates without the need for precomputation.

David Blei

David Meir Blei is a professor in the Statistics and Computer Science departments at Columbia University. Prior to fall 2014 he was an associate professor in the Department of Computer Science at Princeton University. His work is primarily in machine learning. == Research == His research interests include topic models and he was one of the original developers of latent Dirichlet allocation, along with Andrew Ng and Michael I. Jordan. As of June 18, 2020, his publications have been cited 109,821 times, giving him an h-index of 116. == Honors and awards == Blei received the ACM Infosys Foundation Award in 2013. (This award is given to a computer scientist under the age of 45. It has since been renamed the ACM Prize in Computing.) He was named Fellow of ACM "For contributions to the theory and practice of probabilistic topic modeling and Bayesian machine learning" in 2015.

Corpus-assisted discourse studies

Corpus-assisted discourse studies (abbr.: CADS) is related historically and methodologically to the discipline of corpus linguistics. The principal endeavor of corpus-assisted discourse studies is the investigation, and comparison of features of particular discourse types, integrating into the analysis the techniques and tools developed within corpus linguistics. These include the compilation of specialised corpora and analyses of word and word-cluster frequency lists, comparative keyword lists and, above all, concordances. A broader conceptualisation of corpus-assisted discourse studies would include any study that aims to bring together corpus linguistics and discourse analysis. Such research is often labelled as corpus-based or corpus-assisted discourse analysis, with the term CADS coined by a research group in Italy (Partington 2004) for a specific type of corpus-assisted discourse analysis (see the section 'in different countries' below). == Aims == Corpus-assisted discourse studies aim to uncover non-obvious meaning, that is, meaning which might not be readily available to naked-eye perusal. Much of what carries meaning in texts is not open to direct observation: “you cannot understand the world just by looking at it” (Stubbs [after Gellner 1959] 1996: 92). We use language “semi-automatically”, in the sense that speakers and writers make semi-conscious choices within the various complex overlapping systems of which language is composed, including those of transitivity, modality (Michael Halliday 1994), lexical sets (e.g. freedom, liberty, deliverance), modification, and so on. Authors themselves are, famously, generally unaware of all the meanings their texts convey. By combining the quantitative research approach, that is, statistical analysis of large amounts of the discourse in question - more precisely, large numbers of tokens of the discourse type under study contained in a corpus - with the more qualitative research approach typical of discourse analysis, that is, the close, detailed examination of particular stretches of discourse it may be possible to better understand the processes at play in the discourse type and to gain access to non-obvious meanings. Aims can differ in other types of corpus-based or corpus-assisted discourse analysis; but in general such studies combine quantitative and qualitative research and aim to shed light on discourses, registers, discourse patterns, etc., with the help of a corpus linguistic approach. Specific aims and techniques depend on the relevant project. == In different countries == In German-speaking countries: Pioneering work in corpus-based discourse analysis was conducted in Europe, in particular by Hardt-Mautner/Mautner (1995, 2000) and Stubbs (1996, 2001). CADS and other types of corpus-based discourse analysis are inspired by this important early work. In Italy: A considerable body of research has been conducted in Italy either by individual researchers or under the aegis of combined inter-university projects such as Newspool (Partington et al. 2004) and CorDis (Morley and Bayley eds, 2009). It has concentrated on political and media language, mainly because a nucleus of linguists in Italian universities work in Political Science faculties and are increasingly interested in the use of corpus techniques to conduct a particular type of sociopolitical discourse analysis, including the unearthing of noteworthy ideological metaphors and motifs in the language of political figures and institutions. Italian researchers also developed Modern diachronic corpus-assisted discourse studies (MD-CADS). This approach contrasts the language contained in comparable corpora from different but recent points in time in order to track changes in modern language usage but also social, cultural and political changes over modern times, as reflected - and shared among people - in language. It is this Italian body of research that makes most use of the label CADS. In the UK: Linguists in the UK tend to undertake corpus-based critical discourse analysis (CDA). CDA generally adopts a leftist political stance, focusing on the ways that social and political domination is reproduced by text and talk. This type of corpus-based research was originally associated with Lancaster University (Baker et al. 2008), but has spread more widely since. Such work typically studies the discourses around particular groups of people (e.g. Muslims, people with disabilities) or concepts/events (e.g. feminism, same-sex marriage). In Australia: Corpus-based discourse analysis is undertaken by a growing number of Australian researchers, most often on media texts. Some of this work aims to elucidate specific features of discourse types (news, social media, television series, etc.), while other work is rooted in the tradition of corpus-based critical discourse analysis. == Comparison with traditional corpus linguistics == Traditional corpus linguistics has, quite naturally, tended to privilege the quantitative approach. In the drive to produce more authentic dictionaries and grammars of a language, it has been characterised by the compilation of some very large corpora of heterogeneric discourse types in the desire to obtain an overview of the greatest quantity and variety of discourse types possible, in other words, of the chimerical but useful fiction called the “general language” (“general English”, “general Italian”, and so on). This has led to the construction of immensely valuable research tools such as the Bank of English and the British National Corpus. Some branches of corpus linguistics have also promoted an approach that is "corpus-driven", in which we need, grammatically speaking, a mental tabula rasa to free ourselves of the baleful prejudice exerted by traditional models and allow the data to speak entirely for itself. The aim of corpus-assisted discourse studies and related approaches is radically different. Here the aim of the exercise is to acquaint oneself as much as possible with the discourse type(s) in hand. Researchers typically engage with their corpus in a variety of ways. As well as via wordlists and concordancing, intuitions for further research can also arise from reading or watching or listening to parts of the data-set, a process which can help provide a feel for how things are done linguistically in the discourse-type being studied. Corpus-assisted discourse analysis is also typically characterised by the compilation of ad hoc specialised corpora, since very frequently there exists no previously available collection of the discourse type in question. Often, other corpora are utilized in the course of a study for purposes of comparison. These may include pre-existing corpora or may themselves need to be compiled by the researcher. In some sense, all work with corpora – just as all work with discourse - is properly comparative. Even when a single corpus is employed, it is used to test the data it contains against another body of data. This may consist of the researcher's intuitions, or the data found in reference works such as dictionaries and grammars, or it may be statements made by previous authors in the field. == CADS as a specific type of corpus-based discourse analysis == Researchers in Italy have developed CADS as a specific type of corpus-based discourse analysis, creating a standard set of methods: 'A basic, standard methodology in CADS may resemble the following:' Step 1: Decide upon the research question; Step 2: Choose, compile or edit an appropriate corpus; Step 3: Choose, compile or edit an appropriate reference corpus / corpora; Step 4: Make frequency lists and run a keywords comparison of the corpora; Step 5: Determine the existence of sets of key items; Step 6: Concordance interesting key items (with differing quantities of co-text); Step 7: (Possibly) refine the research question and return to Step 2. This basic procedure can of course vary according to individual research circumstances and requirements. A particular way of conceptualising research questions has also been proposed in such CADS projects: Given that P is a discourse participant (or possibly an institution) and G is a goal, often a political goal: How does P achieve G with language? What does this tell us about P? Comparative studies: how do P1 and P2 differ in their use of language? Does this tell us anything about their different principles and objectives? A second general type of CADS research question, which might be asked of interactive discourse data, has been conceptualised as follows: Given that P(x) is a particular participant or set of participants, DT is the discourse type, and R is an observed relationship between or among participants: How do {P(a), P(b)...P(n)} achieve / maintain R in DT [using language]? Another common type of research question has been conceptualised thus: Given that A is an author, Ph(x) is a phenomenon or practice or behaviour, and DT(x) is a particular discourse type. A has said P

AI Resume Builders Reviews: What Actually Works in 2026

Shopping for the best AI resume builder? An AI resume builder is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI resume builder slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

ACLU Mobile Justice

ACLU Mobile Justice was a video live streaming application developed for smartphones by various state chapters of the American Civil Liberties Union. It was intended to allow instant, secure video recording and transmission of interactions with, and perceived abuses by, law enforcement officers. Since its release by the ACLU of California for California residents, other versions of the app have been released for 16 other states and the District of Columbia by their ACLU chapters. It was discontinued in February 2025.

Jared Kaplan

Jared Daniel Kaplan is a theoretical physicist and artificial intelligence researcher. He is an associate professor in the Johns Hopkins University Department of Physics & Astronomy, and a co-founder and chief science officer of Anthropic. == Education == Kaplan attended the Illinois Mathematics and Science Academy during high school. He received a bachelor's degree in physics and mathematics from Stanford University and a PhD in physics from Harvard University. His doctoral thesis is titled Aspects of holography, advised by Nima Arkani-Hamed. == Academic career and physics research == Kaplan’s research interests include quantum gravity, holography (AdS/CFT), conformal field theory, and related topics in particle physics and cosmology. He worked as a postdoctoral fellow at SLAC and Stanford University and has been a professor at Johns Hopkins University since 2012. == Machine learning research == Kaplan joined OpenAI in 2019 as a researcher, where he co-authored Scaling Laws for Neural Language Models (2020), which reported that empirically, the performance of language models steadily improves with their size and the amount of data and compute used for training. He is also a co-author of Language Models are Few-Shot Learners (2020), which introduced GPT-3. At the company, he was also involved in the development of Codex. == Anthropic == Kaplan co-founded Anthropic and serves as its chief science officer. In October 2024, Anthropic announced that Kaplan would serve as the company's "Responsible Scaling Officer", overseeing its responsible scaling policy (RSP). In this role, Kaplan determines the safety assessments and precautions to adopt before model release. In December 2025, The Guardian published an interview with Kaplan about AI autonomy and recursive self-improvement timelines. == Honors and recognition == Kaplan was a Hertz Fellow (2005). He has also received a Sloan Research Fellowship and an NSF CAREER award (PHY-1454083). == Selected works == Scaling Laws for Neural Language Models (2020). Language Models are Few-Shot Learners (2020). A Natural Language for AdS/CFT Correlators (2011). == Personal life == As of 2026, Forbes estimated Kaplan's net worth at $3.7 billion. He lives in Pacifica, California, and has a son.