Curse of dimensionality

Curse of dimensionality

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The curse generally refers to issues that arise when the number of datapoints is small (in a suitably defined sense) relative to the intrinsic dimension of the data. Dimensionally cursed phenomena occur in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. In order to obtain a reliable result, the amount of data needed often grows exponentially with the dimensionality. Also, organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. == Domains == === Combinatorics === In some problems, each variable can take one of several discrete values, or the range of possible values is divided to give a finite number of possibilities. Taking the variables together, a huge number of combinations of values must be considered. This effect is also known as the combinatorial explosion. Even in the simplest case of d {\displaystyle d} binary variables, the number of possible combinations already is 2 d {\displaystyle 2^{d}} , exponential in the dimensionality. Naively, each additional dimension doubles the effort needed to try all combinations. === Sampling === There is an exponential increase in volume associated with adding extra dimensions to a mathematical space. For example, 102 = 100 evenly spaced sample points suffice to sample a unit interval (try to visualize a "1-dimensional" cube, i.e. a line) with no more than 10−2 = 0.01 distance between points; an equivalent sampling of a 10-dimensional unit hypercube with a lattice that has a spacing of 10−2 = 0.01 between adjacent points would require 1020 = [(102)10] sample points. In general, with a spacing distance of 10−n the 10-dimensional hypercube appears to be a factor of 10n(10−1) = [(10n)10/(10n)] "larger" than the 1-dimensional hypercube, which is the unit interval. In the above example n = 2: when using a sampling distance of 0.01 the 10-dimensional hypercube appears to be 1018 "larger" than the unit interval. This effect is a combination of the combinatorics problems above and the distance function problems explained below. === Optimization === When solving dynamic optimization problems by numerical backward induction, the objective function must be computed for each combination of values. This is a significant obstacle when the dimension of the "state variable" is large. === Machine learning === In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. In an abstract sense, as the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially. A typical rule of thumb is that there should be at least 5 training examples for each dimension in the representation. In machine learning and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the peaking phenomenon, which is also known as Hughes phenomenon. This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily. Nevertheless, in the context of a simple classifier (e.g., linear discriminant analysis in the multivariate Gaussian model under the assumption of a common known covariance matrix), Zollanvari et al. showed both analytically and empirically that as long as the relative cumulative efficacy of an additional feature set (with respect to features that are already part of the classifier) is greater (or less) than the size of this additional feature set, the expected error of the classifier constructed using these additional features will be less (or greater) than the expected error of the classifier constructed without them. In other words, both the size of additional features and their (relative) cumulative discriminatory effect are important in observing a decrease or increase in the average predictive power. In metric learning, higher dimensions can sometimes allow a model to achieve better performance. After normalizing embeddings to the surface of a hypersphere, FaceNet achieves the best performance using 128 dimensions as opposed to 64, 256, or 512 dimensions in one ablation study. A loss function for unitary-invariant dissimilarity between word embeddings was found to be minimized in high dimensions. === Data mining === In data mining, the curse of dimensionality refers to a data set with too many features. Consider the first table, which depicts 200 individuals and 2000 genes (features) with a 1 or 0 denoting whether or not they have a genetic mutation in that gene. A data mining application to this data set may be finding the correlation between specific genetic mutations and creating a classification algorithm such as a decision tree to determine whether an individual has cancer or not. A common practice of data mining in this domain would be to create association rules between genetic mutations that lead to the development of cancers. To do this, one would have to loop through each genetic mutation of each individual and find other genetic mutations that occur over a desired threshold and create pairs. They would start with pairs of two, then three, then four until they result in an empty set of pairs. The complexity of this algorithm can lead to calculating all permutations of gene pairs for each individual or row. Given the formula for calculating the permutations of n items with a group size of r is: n ! ( n − r ) ! {\displaystyle {\frac {n!}{(n-r)!}}} , calculating the number of three pair permutations of any given individual would be 7988004000 different pairs of genes to evaluate for each individual. The number of pairs created will grow by an order of factorial as the size of the pairs increase. The growth is depicted in the permutation table (see right). As we can see from the permutation table above, one of the major problems data miners face regarding the curse of dimensionality is that the space of possible parameter values grows exponentially or factorially as the number of features in the data set grows. This problem critically affects both computational time and space when searching for associations or optimal features to consider. Another problem data miners may face when dealing with too many features is that the number of false predictions or classifications tends to increase as the number of features grows in the data set. In terms of the classification problem discussed above, keeping every data point could lead to a higher number of false positives and false negatives in the model. This may seem counterintuitive, but consider the genetic mutation table from above, depicting all genetic mutations for each individual. Each genetic mutation, whether they correlate with cancer or not, will have some input or weight in the model that guides the decision-making process of the algorithm. There may be mutations that are outliers or ones that dominate the overall distribution of genetic mutations when in fact they do not correlate with cancer. These features may be working against one's model, making it more difficult to obtain optimal results. This problem is up to the data miner to solve, and there is no universal solution. The first step any data miner should take is to explore the data, in an attempt to gain an understanding of how it can be used to solve the problem. One must first understand what the data means, and what they are trying to discover before they can decide if anything must be removed from the data set. Then they can create or use a feature selection or dimensionality reduction algorithm to remove samples or features from the data set if they deem it necessary. One example of such methods is the interquartile range method, used to remove outliers in a data set by calculating the standard deviation of a feature or occurrence. === Distance function === When a measure such as a Euclidean distance is defined using many coordinat

Adobe Encore

Adobe Encore (previously Adobe Encore DVD) was a DVD authoring software tool produced by Adobe Systems and targeted at professional video producers. Video and audio resources could be used in their current format for development, allowing the user to transcode them to MPEG-2 video and Dolby Digital audio upon project completion. DVD menus could be created and edited in Adobe Photoshop using special layering techniques. Adobe Encore did not support writing to a Blu-ray Disc using AVCHD 2.0. Encore is bundled with Adobe Premiere Pro CS6. Adobe Encore CS6 was the last release. While Premiere Pro CC has moved to the Creative Cloud, Encore has now been discontinued. == Licensing == All forms of Adobe Encore used a proprietary licensing system from its developer, Adobe Systems. Versions 1.0 and 1.5 required a separate license fee (rather than making 1.5 available as a free update). Version 3, also known as CS3, was sold only in bundle with Premiere CS3. Encore CS4, CS5, CS5.5 and CS6 were only sold in the Premiere Pro CS4, CS5, CS5.5 and CS6 bundles, respectively. Adobe CC subscribers no longer have access to Adobe Encore CS6. Adobe Encore is not included with Premiere Pro CC. == Functionality == Adobe Encore allowed for creating interactive DVD menus from Photoshop documents, which could be tweaked from within Encore. Video and audio streams could be embedded in the DVD and be made to play when certain elements of the menu are interacted with. It had similar functionality to Adobe Flash and Premiere Pro, due to its ability to both edit video on a timeline and embed interactive content.

Networked Help Desk

Networked Help Desk is an open standard initiative to provide a common API for sharing customer support tickets between separate instances of issue tracking, bug tracking, customer relationship management (CRM) and project management systems to improve customer service and reduce vendor lock-in. The initiative was created by Zendesk in June 2011 in collaboration with eight other founding member organizations including Atlassian, New Relic, OTRS, Pivotal Tracker, ServiceNow and SugarCRM. The first integration, between Zendesk and Atlassian's issue tracking product, Jira, was announced at the 2011 Atlassian Summit. By August 2011, 34 member companies had joined the initiative. A year after launching, over 50 organizations had joined. Within Zendesk instances this feature is branded as ticket sharing. == Basis == Support tools are generally built around a common paradigm that begins with a customer making a request or an incident report, these create a ticket. Each ticket has a progress status and is updated with annotations and attachments. These annotations and attachments may be visible to the customer (public), or only visible to analysts (private). Customers are notified of progress made on their ticket until it is complete. If the people necessary to complete a ticket are using separate support tools, additional overhead is introduced in maintaining the relevant information in the ticket in each tool while notifying the customer of progress made by each group in completing their ticket. For example, if a customer support issue is caused by a software bug and reported to a help desk using one system, and then the fix is documented by the developers in another, and analyzed in a customer relationship management tool, keeping the records in each system up-to-date and notifying the customer manually using a swivel chair approach is unnecessarily time-consuming and error-prone. If information is not transferred correctly, a customer may have to re-explain their problem each time their ticket is transferred. For systems with the Networked Help Desk API implemented, it is possible for several different applications related to a customer's support experience to synchronize data in one uniquely identified shared ticket. While many applications in these domains have implemented APIs that allow data to be imported, exported and modified, Network Help Desk provide a common standard for customer support information to automatically synchronize between several systems. Once implemented, two systems can quickly share tickets with just a configuration change as they both understand the same interface. Communication between two instances on a specific ticket occurs in three steps, an invitation agreement, sharing of ticket data and continued synchronization of tickets. The standard allows for "full delegation" (analysts in both systems each make public and private comments and synchronize status) as well as "partial delegation" where the instance receiving the ticket can only make private comments and status changes are not synchronized. Tickets may be shared with multiple instances. == Implementation list ==

Viaweb

Viaweb was a web-based application that allowed users to build and host their own online stores with little technical expertise using a web browser. The company was started in July 1995 by Paul Graham, Robert Morris (using the pseudonym "John McArtyem"), and Trevor Blackwell. Graham claims Viaweb was the first application service provider. Viaweb was also unusual for being partially written in the Lisp programming language. The software was originally called Webgen, but another company was using the same name, so the company renamed it to Viaweb, "because it worked via the Web". In 1998, Yahoo! Inc. bought Viaweb for 455,000 shares of Yahoo! capital stock, valued at about $49 million, and renamed it Yahoo! Store. Viaweb's example has been influential in Silicon Valley's entrepreneurial culture, largely due to Graham's widely read essays and his subsequent career as a successful venture capitalist.

Carrier cloud

In cloud computing, a carrier cloud is a class of cloud that integrates wide area networks (WAN) and other attributes of communications service providers’ carrier-grade networks to enable the deployment of highly-complex applications in the cloud. In contrast, classic cloud computing focuses on the data center and does not address the network connecting data centers and cloud users. This may result in unpredictable response times and security issues when business-critical data are transferred over the Internet. == History == The advent of virtualization technology, cost-effective computing hardware, and ubiquitous Internet connectivity have enabled the first wave of cloud services starting in the early years of the 21st century. But many businesses and other organizations hesitated to move to more demanding applications, from on-premises dedicated hardware to private or public clouds. As a response, communications service providers started in the 2010/2011 time frame to develop carrier clouds that address perceived weaknesses in existing cloud services. Cited weaknesses vary but often include possible downtime, security issues, high cost of custom software and data transfer, inflexibility of some cloud apps, poor customer and nonfulfillment of service level agreements (SLAs). == Characteristics == To enable the deployment of time-sensitive and business critical applications in the cloud, the carrier cloud is designed to match or even exceed the characteristics of on-premises deployments. Therefore, the carrier cloud is characterized by some or all of the following items: Configurable, elastic network performance: Typical cloud computing solutions use the best effort of the public Internet to connect cloud users and data centers. This approach provides instant connectivity but does not offer control over network capacities, latencies, and jitter. Carrier clouds address these gaps with content delivery networks and/or dedicated virtual private networks (VPN) at OSI layers 1 (optical wavelengths), 2 (data link layer), and 3 (network layer). These VPNs can be configured to offer the desired performance parameters and exhibit the same type of elasticity for the network that regular clouds provide for servers and storage. To achieve the requested performance parameters, such as low latency, cloud applications can be (automatically) allocated to distributed data centers that are close enough to the cloud users. Automatic resource placement: For a cloud with multiple data centers, information about both the data center and the connecting network is relevant for a decision of where to place cloud images and storage volumes. For this decision, carrier clouds can obtain relevant information about the network, e.g., using the Application-Layer Traffic Optimization (ALTO) protocol. High level of security and governance: Cloud application providers are subject to general and domain specific security, privacy, and governance requirements and regulations, such as the European Data Protection Directive and the U.S. Health Insurance Portability and Accountability Act. For added security, the wide area network of the carrier cloud can provide segregated encrypted or unencrypted network links that are not accessible from the general Internet. At the data center, the carrier cloud provides e.g. virtual private servers, management processes, logs, and documentation to fulfill security and governance rules. Location control: Fundamentally, cloud users should not be concerned with the geographic location of their cloud resources. However, privacy and other regulations may mandate that certain types of data must not be sent outside a national jurisdiction or other geographical region. Open APIs: Carrier clouds provide graphical user interfaces and Web application programming interfaces that allow cloud application providers to set up, manage, and monitor both, the data center and the WAN, of their cloud services. == Architecture == Carrier clouds encompass data centers at different network tiers and wide area networks that connect multiple data centers to each other as well as to the cloud users. Links between data centers are used for failover, overflow, backup, and geographic diversity. Carrier clouds can be set up as public, private, or hybrid clouds. The carrier cloud federates these cloud entities by using a single management system to orchestrate, manage, and monitor data center and network resources as a single system.

Computational photography

Computational photography refers to digital image capture and processing techniques that use digital computation instead of optical processes. Computational photography can improve the capabilities of a camera, or introduce features that were not possible at all with film-based photography, or reduce the cost or size of camera elements. Examples of computational photography include in-camera computation of digital panoramas, high-dynamic-range images, and light field cameras. Light field cameras use novel optical elements to capture three-dimensional scene information, which can then be used to produce 3D images, enhanced depth-of-field, and selective de-focusing (or "post focus"). Enhanced depth-of-field reduces the need for mechanical focusing systems. All of these features use computational imaging techniques. The definition of computational photography has evolved to cover a number of subject areas in computer graphics, computer vision, and applied optics. These areas are given below, organized according to a taxonomy proposed by Shree K. Nayar. Within each area is a list of techniques, and for each technique, one or two representative papers or books are cited. Deliberately omitted from the taxonomy are image processing (see also digital image processing) techniques applied to traditionally captured images to produce better images. Examples of such techniques are image scaling, dynamic range compression (i.e. tone mapping), color management, image completion (a.k.a. inpainting or hole filling), image compression, digital watermarking, and artistic image effects. Also omitted are techniques that produce range data, volume data, 3D models, 4D light fields, 4D, 6D, or 8D BRDFs, or other high-dimensional image-based representations. Epsilon photography is a sub-field of computational photography. == Effect on photography == Photos taken using computational photography can allow amateurs to produce photographs rivalling the quality of professional photographers, but as of 2019 do not outperform the use of professional-level equipment. == Computational illumination == This is controlling photographic illumination in a structured fashion, then processing the captured images, to create new images. The applications include image-based relighting, image enhancement, image deblurring, geometry/material recovery and so forth. High-dynamic-range imaging uses differently exposed pictures of the same scene to extend dynamic range. Other examples include processing and merging differently illuminated images of the same subject matter ("lightspace"). == Computational optics == This is a capture of optically coded images, followed by computational decoding to produce new images. Coded aperture imaging was mainly applied in astronomy and X-ray imaging to boost the image quality. Instead of a single pin-hole, a pinhole pattern is applied in imaging, and deconvolution is performed to recover the image. In coded exposure imaging, the on/off state of the shutter is coded to modify the kernel of motion blur. In this way, motion deblurring becomes a well-conditioned problem. Similarly, in a lens based coded aperture, the aperture can be modified by inserting a broadband mask. Thus, out of focus deblurring becomes a well-conditioned problem. The coded aperture can also improve the quality in light field acquisition using Hadamard transform optics. Coded aperture patterns can also be designed using color filters, in order to apply different codes at different wavelengths. This allows for increase the amount of light that reaches the camera sensor, compared to binary masks. == Computational imaging == Computational imaging is a set of imaging techniques that combine data acquisition and data processing to create the image of an object through indirect means to yield enhanced resolution, additional information such as optical phase or 3D reconstruction. The information is often recorded without using a conventional optical microscope configuration or with limited datasets. Computational imaging allows going beyond physical limitations of optical systems, such as numerical aperture, or even obliterates the need for optical elements. For parts of the optical spectrum where imaging elements such as objectives are difficult to manufacture or image sensors cannot be miniaturized, computational imaging provides useful alternatives, in fields such as X-ray and THz radiations. === Common techniques === Among common computational imaging techniques are lensless imaging, computational speckle imaging , ptychography and Fourier ptychography. Computational imaging technique often draws on compressive sensing or phase retrieval techniques, where the angular spectrum of the object is reconstructed. Other techniques are related to the field of computational imaging, such as digital holography, computer vision and inverse problems such as tomography. == Computational processing == This is the processing of non-optically-coded images to produce new images. == Computational sensors == These are detectors that combine sensing and processing, typically in hardware, like the oversampled binary image sensor. == Early work in computer vision == Although computational photography is a currently popular buzzword in computer graphics, many of its techniques first appeared in the computer vision literature, either under other names or within papers aimed at 3D shape analysis. == Art history == Computational photography, as an art form, has been practiced by capturing differently exposed pictures of the same subject matter and combining them. This was the inspiration for the development of the wearable computer in the 1970s and early 1980s. Computational photography was inspired by the work of Charles Wyckoff, and thus computational photography datasets (e.g. differently exposed pictures of the same subject matter that are taken in order to make a single composite image) are sometimes referred to as Wyckoff Sets, in his honor. Early work in this area (joint estimation of image projection and exposure value) was undertaken by Mann and Candoccia. Charles Wyckoff devoted much of his life to creating special kinds of 3-layer photographic films that captured different exposures of the same subject matter. A picture of a nuclear explosion, taken on Wyckoff's film, appeared on the cover of Life Magazine and showed the dynamic range from the dark outer areas to the inner core.

AirPair

AirPair is a service and eponymous company that connects people who need help with programming issues (usually, programmers at small technology companies or at finance companies that use technology products) and people who can help them. Unlike services such as oDesk and Elance, AirPair is not a service for outsourcing programming tasks, but rather a service that facilitates one-off knowledge transfers from people with highly specialized knowledge of particular technology stacks or programming issues to people who are in need of specialized help. == History == AirPair launched in March 2013, with founder Jonathon Kresner, who hails from Australia, working full-time, and it soon hired three other part-time developers to work alongside him. Kresner had previously founded two other startups: Preparty, a social invitation and event-booking service based in Australia, and ClimbFind, an online rock-climbing community that reached a million users. Kresner was inspired to work on AirPair because he saw the need for outside expert assistance with programming issues arise regularly at these startups. In November 2013, founder Kresner describes the company's initial success at bootstrapping itself to "Ramen profitability" in a blog post. In December 2013, AirPair was accepted into the Winter 2014 Y Combinator batch. In March 2014, AirPair announced it would launch partnerships with Stripe, Twilio, and other companies that had their own application programming interfaces, allowing developers having trouble with the APIs to seek help over AirPair from experts on the APIs. AirPair presented at the Y Combinator Winter 2014 Demo Day on March 25, 2014, and successfully raised over $1 million within the next 48 hours. == Reception == A review of AirPair by Will Lam stressed that because payment was based on time rather than results, it was important to use it for clearly thought-out questions where one had high confidence that the session would help. Dennis Beatty, who met AirPair founder Jonathon Kresner in March 2014, wrote in April 2014 a glowing review of AirPair's vision of connecting people and its business success. AirPair has been compared with other peer-to-peer coding help sites such as Codementor and HackHands.