In data management, a content-oriented workflow model seeks to articulate workflow progression by the presence of content units (like data-records/objects/documents). Most content-oriented workflow approaches provide a life-cycle model for content units, such that workflow progression can be qualified by conditions on the state of the units. Most approaches are research and work in progress and the content models and life-cycle models are more or less formalized. The term content-oriented workflows is an umbrella term for several scientific workflow approaches, namely "data-driven", "resource-driven", "artifact-centric", "object-aware", and "document-oriented". Thus, the meaning of "content" ranges from simple data attributes to self-contained documents; the term "content-oriented workflows" appeared at first in as an umbrella term. Such a general term, independent from a specific approach, is necessary to contrast the content-oriented modelling principle with traditional activity-oriented workflow models (like Petri nets or BPMN) where a workflow is driven by a control flow and where the content production perspective is neglected or even missing. The term "content" was chosen to subsume the different levels in granularity of the content units in the respective workflow models; it was also chosen to make associations with content management. Both terms "artifact-centric" and "data-driven" would also be good candidates for an umbrella term, but each is closely related to a specific approach of a single working group. The "artifact-centric" group itself (i.e. IBM Research) has generalized the characteristics of their approach and has used "information-centric" as an umbrella term in. Yet, the term information is too unspecific in the context of computer science, thus, "content-orientated workflows" is considered as good compromise. == Workflow Model Approaches == === Data-driven === The data-driven process structures provides a sophisticated workflow model being specialized on hierarchical write-and-review-processes. The approach provides interleaved synchronization of sub-processes and extends activity diagrams. Unfortunately, the COREPRO prototype implementation is not publicly available. Research on the project had been ceased. The general idea has been continued by Reichert in form of the #Object-aware approach. Synonyms data-driven process structures / data-driven modeling and coordination Protagonists Dr. Dominic Müller (University of Twente), Joachim Herbst (DaimlerChrysler Research), and Manfred Reichert (at this time Assoc. Prof. at Univ. of Twente, currently Prof. at Ulm Univ.) Organization(s) University of Twente, DaimlerChrysler Period 2005 - 2007 Selected publications Implementation COREPRO === Resource-driven === The resource-driven workflow system is an early approach that considered workflows from a content-oriented perspective and emphasizes on the missing support for plain document-driven processes by traditional activity-oriented workflow engines. The resource-driven approach demonstrated the application of database triggers for handling workflow events. Still the system implementation is centralized and the workflow schema is statically defined. The project appeared in 2005 but many aspects are considered future work by the authors. Research did not continue on the project. Wang completed his PhD thesis in 2009, yet, his thesis does not mention the resource-driven approach to workflow modelling but is about discrete event simulation. Synonyms Resource-based Workflows / Document-Driven Workflow Systems Protagonists Jianrui Wang and Prof. Akhil Kumar Organization Pennsylvania State University Period 2005 - today Selected publications Implementation N/A === Artifact-centric === The artifact-centric approach provides a framework for content-oriented workflows. In this model, the enterprise application landscape includes distributed business services, while the workflow engine is centralized. Process enactment is integrated with database management system infrastructure, and the project is funded by IBM. Synonyms artifact-centric business process models / artifact-based business process (ACP) / artifact-centric workflows Protagonists Richard Hull and Dr. Kamal Bhattacharya as well as Cagdas E. Gerede and Jianwen Su Organization IBM (T.J. Watson Research Center, NY) Period 2007 - today Selected publications Implementation ArtiFact === Object-aware === The object-aware approach manages a set of object types and generates forms for creating object instances. The form completion flow is controlled by transitions between object configurations each describing a progressing set of mandatory attributes. Each object configuration is named by an object state. The data production flow is user-shifting and it is discrete by defining a sequence of object states. The discussion is currently limited to a centralized system, without any workflows across different organizations. However, the approach is of great relevance to many domains like concurrent engineering. Finally, the object-aware approach and its PHILharmonicFlows system are going to provide general-purpose workflow systems for generic enactment of data production processes. Synonyms object-aware process management / datenorientiertes Prozess-Management-System Protagonists Vera Künzle and Prof. Manfred Reichert Organization Ulm University Period 2009 - today Selected publications Implementation PHILharmonicFlows === Distributed Document-oriented === Distributed document-oriented process management (dDPM) enables distributed case handling in heterogeneous system environments and it is based on document-oriented integration. The workflow model reflects the paper-based working practice in inter-institutional healthcare scenarios. It targets distributed knowledge-driven ad hoc workflows, wherein distributed information systems are required to coordinate work with initially unknown sets of actors and activities. The distributed workflow engine supports process planning & process history as well as participant management and process template creation with import/export. The workflow engine embeds a functional fusion of 1) group-based instant messaging 2) with a shared work list editor 3) with version control. The software implementation of dDPM is α-Flow which is available as open source. dDPM and α-Flow provide a content-oriented approach to schema-less workflows. The complete distributed case handling application is provided in form of a single active Document ("α-Doc"). The α-Doc is a case file (as information carrier) with an embedded workflow engine (in form of active properties). Inviting process participants is equivalent to providing them with a copy of an α-Doc, copying it like an ordinary desktop file. All α-Docs that belong to the same case can synchronize each other, based on the participant management, electronic postboxes, store-and-forward messaging, and an offline-capable synchronization protocol. Synonyms distributed document-oriented process management (dDPM), distributed case handling via active documents Protagonists Christoph P. Neumann and Prof. Richard Lenz Organization Friedrich-Alexander-Universität Erlangen-Nürnberg Period 2009 - 2012 Selected Publications and a PhD thesis Implementation α-Flow (open source) == Related Concepts == === Content Management === The bandwidth of Content management systems (CMS) reaches from Web content management systems (WCMS) and Document management system (DMS) to Enterprise Content Management (ECM). Mature DMS products support document production workflows in a basic form, primarily focusing on review cycle workflows concerning a single document. === Groupware and Computer-Supported Cooperative Work === Groupware focuses on messaging (like E-Mail, Chat, and Instant Messaging), shared calendars (e.g. Lotus Notes, Microsoft Outlook with Exchange Server), and conferencing (e.g. Skype). Groupware overlaps with Computer-supported cooperative work (CSCW), that originated from shared multimedia editors (for live drawing/sketching) and synchronous multi-user applications like desktop sharing. The extensive conceptual claim of CSWC must be put into perspective by its actual solution scope, that is available as the CSCW Matrix. === Case Handling === The case handling paradigm stems from Prof. van der Aalst and gained momentum in 2005. The core features are: (a) provide all information available, i.e. present the case as a whole rather than showing bits and pieces, (b) decide about activities on the basis of the information available rather than the activities already executed, (c) separate work distribution from authorization and allow for additional types of roles, not just the execute role, and (d) allow workers to view and add/modify data before or after the corresponding activities have been executed. In healthcare, the flow of a patient between healthcare professionals is considered as a workflow - with activities that inc
Personal cloud
A personal cloud is a collection of digital content and services that are accessible from any device through the Internet. It is not a tangible entity, but a place that gives users the ability to store, synchronize, stream and share content on a relative core, moving from one platform, screen and location to another. Created on connected services and applications, it reflects and sets consumer expectations for how next-generation computing services will work. The four primary types of personal cloud in use today are: Online cloud, NAS device cloud, server device cloud, and home-made clouds. == Online cloud == The online cloud is sometimes referred to as the public cloud. It is the cloud computing model where online resources like software and data storage are made available over the Internet. Typically, an individual or organization has little control over the ecosystem in which the online cloud is hosted, and the core infrastructure is shared between many individuals and organizations. The data and applications provided by the service provider are logically segregated so that only those authorized are allowed access. == NAS device cloud == A network-attached storage (NAS) device is a computer connected to a network that provides only file-based data storage services to other devices on the network. Although it may technically be possible to run other software on a NAS device, it is not designed to be a general purpose server. Cloud NAS is remote storage that is accessed over the Internet as if it were local. A cloud NAS is often used for backups and archiving. One of the benefits of NAS Cloud is that data in the cloud can be accessed at any time from anywhere. The main drawback, however, is that the speed of the transfer rate is only as fast as the network connection the data is accessed over and can therefore be fairly slow. == Server device cloud == In many ways cloud servers work in the same way as physical servers but the functions they perform can be very different. Typically, the cloud server is an on-premises device that is connected to the Internet and gives users the functions available on the online cloud but with the added benefit and security of the files being in their control on their premises. The server cloud has been historically enterprise-based deployed by businesses needing an in-house cloud. However, there are also in-house options available for individual users. == Home-made clouds == For the more technologically proficient user a common solution for using a personal cloud is to create a home-made cloud system by connecting an external USB hard drive to a Wi-Fi router. This enables both wired and wireless computers to access the USB hard drive and use it for storage or for retrieving files a user needs to share on the network thereby acting like a cloud. Setting up a personal cloud requires a user to have particular skills in technology and network setup. One of the risks associated with improper setup is security, and leaving the files accessible to anyone with technical knowledge. Not every router supports this type of access and modification.
Optical neural network
An optical neural network is a physical implementation of an artificial neural network with optical components. Early optical neural networks used a photorefractive Volume hologram to interconnect arrays of input neurons to arrays of output with synaptic weights in proportion to the multiplexed hologram's strength. Volume holograms were further multiplexed using spectral hole burning to add one dimension of wavelength to space to achieve four dimensional interconnects of two dimensional arrays of neural inputs and outputs. This research led to extensive research on alternative methods using the strength of the optical interconnect for implementing neuronal communications. Some artificial neural networks that have been implemented as optical neural networks include the Hopfield neural network and the Kohonen self-organizing map with liquid crystal spatial light modulators Optical neural networks can also be based on the principles of neuromorphic engineering, creating neuromorphic photonic systems. Typically, these systems encode information in the networks using spikes, mimicking the functionality of spiking neural networks in optical and photonic hardware. Photonic devices that have demonstrated neuromorphic functionalities include (among others) vertical-cavity surface-emitting lasers, integrated photonic modulators, optoelectronic systems based on superconducting Josephson junctions or systems based on resonant tunnelling diodes. == Electrochemical vs. optical neural networks == Biological neural networks function on an electrochemical basis, while optical neural networks use electromagnetic waves. Optical interfaces to biological neural networks can be created with optogenetics, but is not the same as an optical neural networks. In biological neural networks there exist a lot of different mechanisms for dynamically changing the state of the neurons, these include short-term and long-term synaptic plasticity. Synaptic plasticity is among the electrophysiological phenomena used to control the efficiency of synaptic transmission, long-term for learning and memory, and short-term for short transient changes in synaptic transmission efficiency. Implementing this with optical components is difficult, and ideally requires advanced photonic materials. Properties that might be desirable in photonic materials for optical neural networks include the ability to change their efficiency of transmitting light, based on the intensity of incoming light. == Rising Era of Optical Neural Networks == With the increasing significance of computer vision in various domains, the computational cost of these tasks has increased, making it more important to develop the new approaches of the processing acceleration. Optical computing has emerged as a potential alternative to GPU acceleration for modern neural networks, particularly considering the looming obsolescence of Moore's Law. Consequently, optical neural networks have garnered increased attention in the research community. Presently, two primary methods of optical neural computing are under research: silicon photonics-based and free-space optics. Each approach has its benefits and drawbacks; while silicon photonics may offer superior speed, it lacks the massive parallelism that free-space optics can deliver. Given the substantial parallelism capabilities of free-space optics, researchers have focused on taking advantage of it. One implementation, proposed by Lin et al., involves the training and fabrication of phase masks for a handwritten digit classifier. By stacking 3D-printed phase masks, light passing through the fabricated network can be read by a photodetector array of ten detectors, each representing a digit class ranging from 1 to 10. Although this network can achieve terahertz-range classification, it lacks flexibility, as the phase masks are fabricated for a specific task and cannot be retrained. An alternative method for classification in free-space optics, introduced by Cahng et al., employs a 4F system that is based on the convolution theorem to perform convolution operations. This system uses two lenses to execute the Fourier transforms of the convolution operation, enabling passive conversion into the Fourier domain without power consumption or latency. However, the convolution operation kernels in this implementation are also fabricated phase masks, limiting the device's functionality to specific convolutional layers of the network only. In contrast, Li et al. proposed a technique involving kernel tiling to use the parallelism of the 4F system while using a Digital Micromirror Device (DMD) instead of a phase mask. This approach allows users to upload various kernels into the 4F system and execute the entire network's inference on a single device. Unfortunately, modern neural networks are not designed for the 4F systems, as they were primarily developed during the CPU/GPU era. Mostly because they tend to use a lower resolution and a high number of channels in their feature maps. == Other Implementations == In 2007 there was one model of Optical Neural Network: the Programmable Optical Array/Analogic Computer (POAC). It had been implemented in the year 2000 and reported based on modified Joint Fourier Transform Correlator (JTC) and Bacteriorhodopsin (BR) as a holographic optical memory. Full parallelism, large array size and the speed of light are three promises offered by POAC to implement an optical CNN. They had been investigated during the last years with their practical limitations and considerations yielding the design of the first portable POAC version. The practical details – hardware (optical setups) and software (optical templates) – were published. However, POAC is a general purpose and programmable array computer that has a wide range of applications including: image processing pattern recognition target tracking real-time video processing document security optical switching == Progress in the 2020s == Taichi from Tsinghua University in Beijing is a hybrid ONN that combines the power efficiency and parallelism of optical diffraction and the configurability of optical interference. Taichi offers 13.96 million parameters. Taichi avoids the high error rates that afflict deep (multi-layer) networks by combining clusters of fewer-layer diffractive units with arrays of interferometers for reconfigurable computation. Its encoding protocol divides large network models into sub-models that can be distributed across multiple chiplets in parallel. Taichi achieved 91.89% accuracy in tests with the Omniglot database. It was also used to generate music Bach and generate images the styles of Van Gogh and Munch. The developers claimed energy efficiency of up to 160 trillion operations second−1 watt−1 and an area efficiency of 880 trillion multiply-accumulate operations mm−2 or 103 more energy efficient than the NVIDIA H100, and 102 times more energy efficient and 10 times more area efficient than previous ONNs. Time dimension has recently been introduced into diffractive neural network by fs laser lithography of perovskite hydration. The temporal behaviour of the neuron can be modulated by the fs laser at the nanoscale, enabling a programmable holographic neural network with temporal evolution functionality, i.e., the functionality can change with time under the hydration stimuli. An in-memory temporal inference functionality was demonstrated to mimic the function evolution of the human brain, i.e., the functionality can change from simple digit image classification to more complicated digit and clothing product image classification with time. This is the first time of introducing time dimension into the optical neural network, laying a foundation for future brain-like photonic chip development.
List of datasets for machine-learning research
These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce. Many organizations, including governments, publish and share their datasets, often using common metadata formats (such as Croissant). The datasets are classified, based on the licenses, into two groups: open data and non-open data. The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes. == List of sorting used for datasets == The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions. == List of open data portals == == List of portals suitable for multiple types of applications == The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications. == List of portals suitable for a specific subtype of applications == The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections. == Image data == == Text data == These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis. === Reviews === === News articles === === Messages === === Twitter and tweets === === Dialogues === === Legal === === Other text === == Sound data == These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis. === Speech === === Music === === Other sounds === == Signal data == Datasets containing electric signal information requiring some sort of signal processing for further analysis. === Electrical === === Motion-tracking === === Other signals === == Chemical data == Datasets from physical systems. === Chemical Reactions with transition states (TS) === === OpenReACT-CHON-EFH === OpenReACT-CHON-EFH (Open Reaction Dataset of Atomic ConfiguraTions comprising C, H, O and N with Energies, Forces and Hessians) is a 2025 open-access benchmark for machine-learning interatomic potentials. RTP set – 35,087 stationary-point geometries (reactant, transition state and product) drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G(d) level. IRC set – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points. NMS set – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations. The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein. The dataset itself is distributed under a CC licence via Figshare. == Physical data == Datasets from physical systems. === High-energy physics === === Systems === === Astronomy === === Earth science === === Other physical === == Biological data == Datasets from biological systems. === Human === === Animal === === Fungi === === Plant === === Microbe === === Drug discovery === == Anomaly data == == Question answering data == This section includes datasets that deals with structured data. == Dialog or instruction prompted data == This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request. == Cybersecurity == == Climate and sustainability == == Code data == == Multivariate data == === Financial === === Weather === === Census === === Transit === === Internet === === Games === === Other multivariate === == Curated repositories of datasets == As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research. OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms. PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API. Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic. Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.
Vapnik–Chervonenkis theory
Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view. == Introduction == VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory): Theory of consistency of learning processes What are (necessary and sufficient) conditions for consistency of a learning process based on the empirical risk minimization principle? Nonasymptotic theory of the rate of convergence of learning processes How fast is the rate of convergence of the learning process? Theory of controlling the generalization ability of learning processes How can one control the rate of convergence (the generalization ability) of the learning process? Theory of constructing learning machines How can one construct algorithms that can control the generalization ability? VC Theory is a major subbranch of statistical learning theory. One of its main applications in statistical learning theory is to provide generalization conditions for learning algorithms. From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization. In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics. == Overview of VC theory in empirical processes == === Background on empirical processes === Let ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} be a measurable space. For any measure Q {\displaystyle Q} on ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} , and any measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } , define Q f = ∫ f d Q {\displaystyle Qf=\int fdQ} Measurability issues will be ignored here, for more technical detail see. Let F {\displaystyle {\mathcal {F}}} be a class of measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } and define: ‖ Q ‖ F = sup { | Q f | : f ∈ F } . {\displaystyle \|Q\|_{\mathcal {F}}=\sup\{\vert Qf\vert \ :\ f\in {\mathcal {F}}\}.} Let X 1 , … , X n {\displaystyle X_{1},\ldots ,X_{n}} be independent, identically distributed random elements of ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} . Then define the empirical measure P n = n − 1 ∑ i = 1 n δ X i , {\displaystyle \mathbb {P} _{n}=n^{-1}\sum _{i=1}^{n}\delta _{X_{i}},} where δ here stands for the Dirac measure. The empirical measure induces a map F → R {\displaystyle {\mathcal {F}}\to \mathbf {R} } given by: f ↦ P n f = 1 n ( f ( X 1 ) + . . . + f ( X n ) ) {\displaystyle f\mapsto \mathbb {P} _{n}f={\frac {1}{n}}(f(X_{1})+...+f(X_{n}))} Now suppose P is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes F {\displaystyle {\mathcal {F}}} for which statements such as the following hold: uniform law of large numbers: ‖ P n − P ‖ F → n 0 , {\displaystyle \|\mathbb {P} _{n}-P\|_{\mathcal {F}}{\underset {n}{\to }}0,} That is, as n → ∞ {\displaystyle n\to \infty } , | 1 n ( f ( X 1 ) + . . . + f ( X n ) ) − ∫ f d P | → 0 {\displaystyle \left|{\frac {1}{n}}(f(X_{1})+...+f(X_{n}))-\int fdP\right|\to 0} uniformly for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . uniform central limit theorem: G n = n ( P n − P ) ⇝ G , in ℓ ∞ ( F ) {\displaystyle \mathbb {G} _{n}={\sqrt {n}}(\mathbb {P} _{n}-P)\rightsquigarrow \mathbb {G} ,\quad {\text{in }}\ell ^{\infty }({\mathcal {F}})} In the former case F {\displaystyle {\mathcal {F}}} is called Glivenko–Cantelli class, and in the latter case (under the assumption ∀ x , sup f ∈ F | f ( x ) − P f | < ∞ {\displaystyle \forall x,\sup \nolimits _{f\in {\mathcal {F}}}\vert f(x)-Pf\vert <\infty } ) the class F {\displaystyle {\mathcal {F}}} is called Donsker or P-Donsker. A Donsker class is Glivenko–Cantelli in probability by an application of Slutsky's theorem. These statements are true for a single f {\displaystyle f} , by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . Intuitively then, the set F {\displaystyle {\mathcal {F}}} cannot be too large, and as it turns out that the geometry of F {\displaystyle {\mathcal {F}}} plays a very important role. One way of measuring how big the function set F {\displaystyle {\mathcal {F}}} is to use the so-called covering numbers. The covering number N ( ε , F , ‖ ⋅ ‖ ) {\displaystyle N(\varepsilon ,{\mathcal {F}},\|\cdot \|)} is the minimal number of balls { g : ‖ g − f ‖ < ε } {\displaystyle \{g:\|g-f\|<\varepsilon \}} needed to cover the set F {\displaystyle {\mathcal {F}}} (here it is obviously assumed that there is an underlying norm on F {\displaystyle {\mathcal {F}}} ). The entropy is the logarithm of the covering number. Two sufficient conditions are provided below, under which it can be proved that the set F {\displaystyle {\mathcal {F}}} is Glivenko–Cantelli or Donsker. A class F {\displaystyle {\mathcal {F}}} is P-Glivenko–Cantelli if it is P-measurable with envelope F such that P ∗ F < ∞ {\displaystyle P^{\ast }F<\infty } and satisfies: ∀ ε > 0 sup Q N ( ε ‖ F ‖ Q , F , L 1 ( Q ) ) < ∞ . {\displaystyle \forall \varepsilon >0\quad \sup \nolimits _{Q}N(\varepsilon \|F\|_{Q},{\mathcal {F}},L_{1}(Q))<\infty .} The next condition is a version of Dudley's theorem. If F {\displaystyle {\mathcal {F}}} is a class of functions such that ∫ 0 ∞ sup Q log N ( ε ‖ F ‖ Q , 2 , F , L 2 ( Q ) ) d ε < ∞ {\displaystyle \int _{0}^{\infty }\sup \nolimits _{Q}{\sqrt {\log N\left(\varepsilon \|F\|_{Q,2},{\mathcal {F}},L_{2}(Q)\right)}}d\varepsilon <\infty } then F {\displaystyle {\mathcal {F}}} is P-Donsker for every probability measure P such that P ∗ F 2 < ∞ {\displaystyle P^{\ast }F^{2}<\infty } . In the last integral, the notation means ‖ f ‖ Q , 2 = ( ∫ | f | 2 d Q ) 1 2 {\displaystyle \|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}} . === Symmetrization === The majority of the arguments about how to bound the empirical process rely on symmetrization, maximal and concentration inequalities, and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section). It is presented here: Consider the empirical process: f ↦ ( P n − P ) f = 1 n ∑ i = 1 n ( f ( X i ) − P f ) {\displaystyle f\mapsto (\mathbb {P} _{n}-P)f={\dfrac {1}{n}}\sum _{i=1}^{n}(f(X_{i})-Pf)} Turns out that there is a connection between the empirical and the following symmetrized process: f ↦ P n 0 f = 1 n ∑ i = 1 n ε i f ( X i ) {\displaystyle f\mapsto \mathbb {P} _{n}^{0}f={\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})} The symmetrized process is a Rademacher process, conditionally on the data X i {\displaystyle X_{i}} . Therefore, it is a sub-Gaussian process by Hoeffding's inequality. Lemma (Symmetrization). For every nondecreasing, convex Φ: R → R and class of measurable functions F {\displaystyle {\mathcal {F}}} , E Φ ( ‖ P n − P ‖ F ) ≤ E Φ ( 2 ‖ P n 0 ‖ F ) {\displaystyle \mathbb {E} \Phi (\|\mathbb {P} _{n}-P\|_{\mathcal {F}})\leq \mathbb {E} \Phi \left(2\left\|\mathbb {P} _{n}^{0}\right\|_{\mathcal {F}}\right)} The proof of the Symmetrization lemma relies on introducing independent copies of the original variables X i {\displaystyle X_{i}} (sometimes referred to as a ghost sample) and replacing the inner expectation of the LHS by these copies. After an application of Jensen's inequality different signs could be introduced (hence the name symmetrization) without changing the expectation. The proof can be found below because of its instructive nature. The same proof method can be used to prove the Glivenko–Cantelli theorem. A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to P n 0 {\displaystyle \mathbb {P} _{n}^{0}} and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties. === VC Connection === It turns out that there is a fascinating connection between certain combinatorial properties of the set F {\displaystyle {\mathcal {F}}} and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik–Chervonenkis classes of sets – or shortly VC sets. Consider a collection C {\displaystyle {\mathcal {C}}} of subsets of the sample space X {\displaystyle
Reconstruction from projections
The problem of reconstructing a multidimensional signal from its projection is uniquely multidimensional, having no 1-D counterpart. It has applications that range from computer-aided tomography to geophysical signal processing. It is a problem which can be explored from several points of view—as a deconvolution problem, a modeling problem, an estimation problem, or an interpolation problem. == Motivation and applications == Many fields in science and engineering use reconstruction from projections, especially in imaging. It is widely applied geophysical tomography, medical imaging and industrial radiography. For example, in a CT scanner, the 3D structure of the patient’s body being scanned is measured with beams going through the tissue and hitting a detector, giving a flat projection of the body from that angle. Multiple projections are put together to get an image of the position and shape of structures inside in 3D. == Problem statement and basics == A projection is a linear mapping of an M {\displaystyle M} dimensional signal into an N {\displaystyle N} dimensional one, where N ≤ M {\displaystyle N\leq M} . And the objective of reconstruction is to restore the M {\displaystyle M} dimensional signal based on the N {\displaystyle N} dimensional signal. The following case is a 2-D signal projected into 1D signal. The signal in the original coordinate is denoted as d ( u , v ) {\displaystyle d(u,v)} . Now consider a collimated beam of radiation coming from the opposite orientation of v ^ {\displaystyle {\hat {v}}} , producing a projection along u ^ {\displaystyle {\hat {u}}} . v ^ {\displaystyle {\hat {v}}} and u ^ {\displaystyle {\hat {u}}} are normal to each other, and the angle between u {\displaystyle u} and u ^ {\displaystyle {\hat {u}}} is theta. The signal obtained along u ^ {\displaystyle {\hat {u}}} axis is defined to be p θ ( u ^ ) {\displaystyle p_{\theta }({\hat {u}})} . The relationship between the original coordinate and the rotated coordinate is given by [ u ^ v ^ ] = [ cos θ sin θ − sin θ cos θ ] [ u v ] {\displaystyle {\begin{bmatrix}{\hat {u}}\\{\hat {v}}\end{bmatrix}}={\begin{bmatrix}\cos \theta &\sin \theta \\-\sin \theta &\cos \theta \end{bmatrix}}{\begin{bmatrix}u\\v\end{bmatrix}}} or inversely, [ u v ] = [ cos θ − sin θ sin θ cos θ ] [ u ^ v ^ ] {\displaystyle {\begin{bmatrix}u\\v\end{bmatrix}}={\begin{bmatrix}\cos \theta &-\sin \theta \\\sin \theta &\cos \theta \end{bmatrix}}{\begin{bmatrix}{\hat {u}}\\{\hat {v}}\end{bmatrix}}} Then we have p θ ( u ^ ) = ∫ − ∞ ∞ d ( u , v ) d v ^ = ∫ − ∞ ∞ d ( u ^ cos ( θ ) − v ^ sin ( θ ) , u ^ sin ( θ ) + v ^ cos ( θ ) ) d v ^ {\displaystyle p_{\theta }({\hat {u}})=\int _{-\infty }^{\infty }d(u,v)\,\mathrm {d} {\hat {v}}=\int _{-\infty }^{\infty }d({\hat {u}}\cos(\theta )-{\hat {v}}\sin(\theta ),{\hat {u}}\sin(\theta )+{\hat {v}}\cos(\theta ))\,\mathrm {d} {\hat {v}}} By varying theta, a large number of projections can be obtained. Given the projection-slice theorem, D ( Ω , θ ) {\displaystyle D(\Omega ,\theta )} ,the slice of the Fourier transform of d ( u , v ) {\displaystyle d(u,v)} at angle theta, is equivalent to P θ ( Ω ) {\displaystyle P_{\theta }(\Omega )} , the Fourier Transform of the projection p θ ( u ^ ) {\displaystyle p_{\theta }({\hat {u}})} . Therefore, the unknown d ( u , v ) {\displaystyle d(u,v)} can be obtained from its Fourier transform by means of the Fourier transform inversion integral d ( u , v ) = 1 4 π 2 ∫ − ∞ ∞ ∫ − ∞ ∞ D ( Ω 1 , Ω 2 ) e j Ω 1 u e j Ω 2 v d Ω 1 , Ω 2 {\displaystyle \mathrm {d} (u,v)={\frac {1}{4\pi ^{2}}}\int _{-\infty }^{\infty }\int _{-\infty }^{\infty }D(\Omega _{1},\Omega _{2})e^{j\Omega _{1}u}e^{j\Omega _{2}v}\,\mathrm {d} \Omega _{1},\Omega _{2}} = 1 4 π 2 ∫ 0 ∞ ∫ − π π D ( Ω , θ ) e j Ω u cos ( θ ) e j Ω v s i n θ | Ω | d Ω d θ {\displaystyle ={\frac {1}{4\pi ^{2}}}\int _{0}^{\infty }\int _{-\pi }^{\pi }D(\Omega ,\theta )e^{j\Omega u\cos(\theta )}e^{j\Omega vsin\theta }{\begin{vmatrix}\Omega \end{vmatrix}}\,\mathrm {d} \Omega \mathrm {d} \theta } = 1 4 π 2 ∫ − π π ∫ 0 ∞ P θ ( Ω ) e j Ω ( u cos θ + v sin θ ) | Ω | d Ω d θ {\displaystyle ={\frac {1}{4\pi ^{2}}}\int _{-\pi }^{\pi }\int _{0}^{\infty }P_{\theta }(\Omega )e^{j}\Omega (u\cos \theta +v\sin \theta ){\begin{vmatrix}\Omega \end{vmatrix}}\,\mathrm {d} \Omega \mathrm {d} \theta } = 1 4 π 2 ∫ 0 π ( ∫ − ∞ ∞ P θ ( Ω ) | Ω | {\displaystyle ={\frac {1}{4\pi ^{2}}}\int _{0}^{\pi }(\int _{-\infty }^{\infty }P_{\theta }(\Omega ){\begin{vmatrix}\Omega \end{vmatrix}}} e j Ω u ^ d Ω ) d θ {\displaystyle e^{j\Omega {\hat {u}}}\mathrm {d} \Omega )\mathrm {d} \theta } By taking the inverse Fourier Transform and assuming g ( u ^ ) = F − 1 ( | Ω | 2 ) {\displaystyle g({\hat {u}})={\mathcal {F}}^{-1}({{\begin{vmatrix}\Omega \end{vmatrix}}^{2}})} , we get d ( u , v ) = ∑ i △ θ i [ p θ ( u ^ ) ∗ g θ i ( u ^ ) ] {\displaystyle d(u,v)=\sum _{i}\vartriangle \theta _{i}[p_{\theta }({\hat {u}})g_{\theta i}({\hat {u}})]} == Approaches == In practice, there are a wide variety of methods that are utilized, most of which are reconstruct 3-D information (volume) from 2-D signals (image). Typically used methods are CT, MRI, PET and SPECT. And the filtered back projection based on the principles introduced above are commonly applied. === Computed Tomography (CT) === In CT, a volume is formed by stacking the axial slices. The software cuts the volume in a different plane (usually orthogonal). Commonly, slice data is generated using an X-ray source that rotates around the object. X-ray sensors are positioned on the opposite side of the circle from the X-ray source. === Magnetic resonance imaging (MRI) === In MRI, energy from an oscillating magnetic field is temporarily applied to the patient at the appropriate resonance frequency. The protons (hydrogen atoms) emit a radio frequency signal which is measured by a receiving coil. The radio signal can be made to encode position information by varying the main magnetic field using gradient coils. === Positron emission tomography (PET) === The system detects pairs of gamma rays emitted indirectly by a positron-emitting radionuclide (tracer), which is introduced into the body on a biologically active molecule. Three-dimensional images of tracer concentration within the body are then constructed by computer analysis. In modern PET-CT scanners, three dimensional imaging is often accomplished with the aid of a CT X-ray scan performed on the patient during the same session, in the same machine. === Single-photon emission computed tomography (SPECT) === SPECT imaging is performed by using a gamma camera to acquire multiple 2-D images (projections) from multiple angles. Multiple projections are used to yield a 3-D data set. This data set may then be manipulated to show thin slices along any chosen axis of the body. SPECT is similar to PET in its use of radioactive tracer material and detection of gamma rays, while the tracers used in SPECT emit gamma radiation that is measured more directly.
XGBoost
XGBoost (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable and Distributed Gradient Boosting (GBM, GBRT, GBDT) Library". It runs on a single machine, as well as the distributed processing frameworks Apache Hadoop, Apache Spark, Apache Flink, and Dask. XGBoost gained much popularity and attention in the mid-2010s as the algorithm of choice for many winning teams of machine learning competitions. == History == XGBoost initially started as a research project by Tianqi Chen as part of the Distributed (Deep) Machine Learning Community (DMLC) group at the University of Washington. Initially, it began as a terminal application which could be configured using a libsvm configuration file. It became well known in the ML competition circles after its use in the winning solution of the Higgs Machine Learning Challenge. Soon after, the Python and R packages were built, and XGBoost now has package implementations for Java, Scala, Julia, Perl, and other languages. This brought the library to more developers and contributed to its popularity among the Kaggle community, where it has been used for a large number of competitions. It was soon integrated with a number of other packages making it easier to use in their respective communities. It has now been integrated with scikit-learn for Python users and with the caret package for R users. It can also be integrated into Data Flow frameworks like Apache Spark, Apache Hadoop, and Apache Flink using the abstracted Rabit and XGBoost4J. XGBoost is also available on OpenCL for FPGAs. An efficient, scalable implementation of XGBoost has been published by Tianqi Chen and Carlos Guestrin. While the XGBoost model often achieves higher accuracy than a single decision tree, it sacrifices the intrinsic interpretability of decision trees. For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder. == Features == Salient features of XGBoost which make it different from other gradient boosting algorithms include: Clever penalization of trees A proportional shrinking of leaf nodes Newton Boosting Extra randomization parameter Implementation on single, distributed systems and out-of-core computation Automatic feature selection Theoretically justified weighted quantile sketching for efficient computation Parallel tree structure boosting with sparsity Efficient cacheable block structure for decision tree training == The algorithm == XGBoost works as Newton–Raphson in function space unlike gradient boosting that works as gradient descent in function space, a second order Taylor approximation is used in the loss function to make the connection to Newton–Raphson method. A generic unregularized XGBoost algorithm is: == Parameters == XGBoost has parameters which can be specified to affect how it functions and performs. Some parameters include: Learning rate (also known as the "step size" or “"shrinkage"), it is a number between 0 and 1 (default is 0.3), which determines the rate the algorithm learns from each iteration. n_estimators, sets the number of trees to be built in the ensemble, where more trees generally increases the complexity of the model, but can lead to overfitting with too many trees. Gamma (also known as Lagrange multiplier or the minimum loss reduction parameter), controls the minimum amount of loss reduction required to make a further split on a leaf node of the tree. The default in XGBoost is 0 . max_depth, represents how deeply each tree in the boosting process can grow during training, where the default is 6. == Awards == John Chambers Award (2016) High Energy Physics meets Machine Learning award (HEP meets ML) (2016)