AI Chat Free No Limit

AI Chat Free No Limit — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Algorithm selection

    Algorithm selection

    Algorithm selection (sometimes also called per-instance algorithm selection or offline algorithm selection) is a meta-algorithmic technique to choose an algorithm from a portfolio on an instance-by-instance basis. It is motivated by the observation that on many practical problems, different algorithms have different performance characteristics. That is, while one algorithm performs well in some scenarios, it performs poorly in others and vice versa for another algorithm. If we can identify when to use which algorithm, we can optimize for each scenario and improve overall performance. This is what algorithm selection aims to do. The only prerequisite for applying algorithm selection techniques is that there exists (or that there can be constructed) a set of complementary algorithms. == Definition == Given a portfolio P {\displaystyle {\mathcal {P}}} of algorithms A ∈ P {\displaystyle {\mathcal {A}}\in {\mathcal {P}}} , a set of instances i ∈ I {\displaystyle i\in {\mathcal {I}}} and a cost metric m : P × I → R {\displaystyle m:{\mathcal {P}}\times {\mathcal {I}}\to \mathbb {R} } , the algorithm selection problem consists of finding a mapping s : I → P {\displaystyle s:{\mathcal {I}}\to {\mathcal {P}}} from instances I {\displaystyle {\mathcal {I}}} to algorithms P {\displaystyle {\mathcal {P}}} such that the cost ∑ i ∈ I m ( s ( i ) , i ) {\displaystyle \sum _{i\in {\mathcal {I}}}m(s(i),i)} across all instances is optimized. == Examples == === Boolean satisfiability problem (and other hard combinatorial problems) === A well-known application of algorithm selection is the Boolean satisfiability problem. Here, the portfolio of algorithms is a set of (complementary) SAT solvers, the instances are Boolean formulas, the cost metric is for example average runtime or number of unsolved instances. So, the goal is to select a well-performing SAT solver for each individual instance. In the same way, algorithm selection can be applied to many other N P {\displaystyle {\mathcal {NP}}} -hard problems (such as mixed integer programming, CSP, AI planning, TSP, MAXSAT, QBF and answer set programming). Competition-winning systems in SAT are SATzilla, 3S and CSHC === Machine learning === In machine learning, algorithm selection is better known as meta-learning. The portfolio of algorithms consists of machine learning algorithms (e.g., Random Forest, SVM, DNN), the instances are data sets and the cost metric is for example the error rate. So, the goal is to predict which machine learning algorithm will have a small error on each data set. == Instance features == The algorithm selection problem is mainly solved with machine learning techniques. By representing the problem instances by numerical features f {\displaystyle f} , algorithm selection can be seen as a multi-class classification problem by learning a mapping f i ↦ A {\displaystyle f_{i}\mapsto {\mathcal {A}}} for a given instance i {\displaystyle i} . Instance features are numerical representations of instances. For example, we can count the number of variables, clauses, average clause length for Boolean formulas, or number of samples, features, class balance for ML data sets to get an impression about their characteristics. === Static vs. probing features === We distinguish between two kinds of features: Static features are in most cases some counts and statistics (e.g., clauses-to-variables ratio in SAT). These features ranges from very cheap features (e.g. number of variables) to very complex features (e.g., statistics about variable-clause graphs). Probing features (sometimes also called landmarking features) are computed by running some analysis of algorithm behavior on an instance (e.g., accuracy of a cheap decision tree algorithm on an ML data set, or running for a short time a stochastic local search solver on a Boolean formula). These feature often cost more than simple static features. === Feature costs === Depending on the used performance metric m {\displaystyle m} , feature computation can be associated with costs. For example, if we use running time as performance metric, we include the time to compute our instance features into the performance of an algorithm selection system. SAT solving is a concrete example, where such feature costs cannot be neglected, since instance features for CNF formulas can be either very cheap (e.g., to get the number of variables can be done in constant time for CNFs in the DIMACs format) or very expensive (e.g., graph features which can cost tens or hundreds of seconds). It is important to take the overhead of feature computation into account in practice in such scenarios; otherwise a misleading impression of the performance of the algorithm selection approach is created. For example, if the decision which algorithm to choose can be made with perfect accuracy, but the features are the running time of the portfolio algorithms, there is no benefit to the portfolio approach. This would not be obvious if feature costs were omitted. == Approaches == === Regression approach === One of the first successful algorithm selection approaches predicted the performance of each algorithm m ^ A : I → R {\displaystyle {\hat {m}}_{\mathcal {A}}:{\mathcal {I}}\to \mathbb {R} } and selected the algorithm with the best predicted performance a r g min A ∈ P m ^ A ( i ) {\displaystyle arg\min _{{\mathcal {A}}\in {\mathcal {P}}}{\hat {m}}_{\mathcal {A}}(i)} for an instance i {\displaystyle i} . === Clustering approach === A common assumption is that the given set of instances I {\displaystyle {\mathcal {I}}} can be clustered into homogeneous subsets and for each of these subsets, there is one well-performing algorithm for all instances in there. So, the training consists of identifying the homogeneous clusters via an unsupervised clustering approach and associating an algorithm with each cluster. A new instance is assigned to a cluster and the associated algorithm selected. A more modern approach is cost-sensitive hierarchical clustering using supervised learning to identify the homogeneous instance subsets. === Pairwise cost-sensitive classification approach === A common approach for multi-class classification is to learn pairwise models between every pair of classes (here algorithms) and choose the class that was predicted most often by the pairwise models. We can weight the instances of the pairwise prediction problem by the performance difference between the two algorithms. This is motivated by the fact that we care most about getting predictions with large differences correct, but the penalty for an incorrect prediction is small if there is almost no performance difference. Therefore, each instance i {\displaystyle i} for training a classification model A 1 {\displaystyle {\mathcal {A}}_{1}} vs A 2 {\displaystyle {\mathcal {A}}_{2}} is associated with a cost | m ( A 1 , i ) − m ( A 2 , i ) | {\displaystyle |m({\mathcal {A}}_{1},i)-m({\mathcal {A}}_{2},i)|} . == Requirements == The algorithm selection problem can be effectively applied under the following assumptions: The portfolio P {\displaystyle {\mathcal {P}}} of algorithms is complementary with respect to the instance set I {\displaystyle {\mathcal {I}}} , i.e., there is no single algorithm A ∈ P {\displaystyle {\mathcal {A}}\in {\mathcal {P}}} that dominates the performance of all other algorithms over I {\displaystyle {\mathcal {I}}} (see figures to the right for examples on complementary analysis). In some application, the computation of instance features is associated with a cost. For example, if the cost metric is running time, we have also to consider the time to compute the instance features. In such cases, the cost to compute features should not be larger than the performance gain through algorithm selection. == Application domains == Algorithm selection is not limited to single domains but can be applied to any kind of algorithm if the above requirements are satisfied. Application domains include: hard combinatorial problems: SAT, Mixed Integer Programming, CSP, AI Planning, TSP, MAXSAT, QBF and Answer Set Programming combinatorial auctions in machine learning, the problem is known as meta-learning software design black-box optimization multi-agent systems numerical optimization linear algebra, differential equations evolutionary algorithms vehicle routing problem power systems For an extensive list of literature about algorithm selection, we refer to a literature overview. == Variants of algorithm selection == === Online selection === Online algorithm selection refers to switching between different algorithms during the solving process. This is useful as a hyper-heuristic. In contrast, offline algorithm selection selects an algorithm for a given instance only once and before the solving process. === Computation of schedules === An extension of algorithm selection is the per-instance algorithm scheduling problem, in which we do not select only one solver, but we select a time budget for each algorithm

    Read more →
  • Enterprise bus matrix

    Enterprise bus matrix

    The enterprise bus matrix is a data warehouse planning tool and model created by Ralph Kimball, and is part of the data warehouse bus architecture. The matrix is the logical definition of one of the core concepts of Kimball's approach to dimensional modeling conformed dimension. The bus matrix defines part of the data warehouse bus architecture and is an output of the business requirements phase in the Kimball lifecycle. It is applied in the following phases of dimensional modeling and development of the data warehouse. The matrix can be categorized as a hybrid model, being part technical design tool, part project management tool and part communication tool == Background == The need for an enterprise bus matrix stems from the way one goes about creating the overall data warehouse environment. Historically there have been two approaches: a structured, centralized and planned approach and a more loosely defined, department specific approach, in which solutions are developed in a more independent matter. Autonomous projects can result in a range of isolated stove pipe data marts. Naturally each approach has its issues; the visionary approach often struggles with long delivery cycles and lack of reaction time as needs emerge and scope issues arise. On the other hand, the development of isolated data marts leads to stovepipe systems that lack synergy in development. Over time this approach will lead to a so-called data-mart-in-a-box architecture where interoperability and lack of cohesion is apparent, and can hinder the realization of an overall enterprise data warehouse. As an attempt to handle this issue, Ralph Kimball introduced the enterprise bus. == Description == The bus matrix purpose is one of high abstraction and visionary planning on the data warehouse architectural level. By dictating coherency in the development and implementation of an overall data warehouse the bus architecture approach enables an overall vision of the broader enterprise integration and consistency while at the same time dividing the problem into more manageable parts – all in a technology and software independent manner. The bus matrix and architecture builds upon the concept of conformed dimensions, creating a structure of common dimensions that ideally can be used across the enterprise by all business processes related to the data warehouse and the corresponding fact tables from which they derive their context. According to Kimball and Margy Ross's article “Differences of Opinion” "The Enterprise Data warehouse built on the bus architecture ”identifies and enforces the relationship between business process metrics (facts) and descriptive attributes (dimensions)”. The concept of a bus is well known in the language of information technology, and is what reflects the conformed dimension concept in the data warehouse, creating the skeletal structure where all parts of a system connect, ensuring interoperability and consistency of data, and at the same time considers future expansion. This makes the conformed dimensions act as the integration ‘glue’, creating a robust backbone of the enterprise Data Warehouse.

    Read more →
  • Interviewer effect

    Interviewer effect

    The interviewer effect (also called interviewer variance or interviewer error) is the distortion of response to an interviewer-administered data collection effort which results from differential reactions to the social style and personality of interviewers or to their presentation of particular questions. The use of fixed-wording questions is one method of reducing interviewer bias. Anthropological research and case-studies are also affected by the problem, which is exacerbated by the self-fulfilling prophecy, when the researcher is also the interviewer it is also any effect on data gathered from interviewing people that is caused by the behavior or characteristics (real or perceived) of the interviewer. Interviewer effects can also be associated with the characteristics of the interviewer, such as race. Whether black respondents are interviewed by white interviewers or black interviewers has a strong impact on their responses to both attitude questions and behavioral ones. In the latter case, for example, if black respondents are interviewed by black interviewers in pre-election surveys, they are more likely to actually vote in the upcoming election than if they are interviewed by white interviewers. Furthermore, the race of the interviewer can also affect answers to factual questions that might take the form of a test of how informed the respondent is. Black respondents in a survey of political knowledge, for example, get fewer correct answers to factual questions about politics when interviewed by white interviewers than when interviewed by black interviewers. This is consistent with the research literature on stereotype threat, which finds diminished test performance of potentially stigmatised groups when the interviewer or test supervisor is from a perceived higher status group. Interviewer effects can be mitigated somewhat by randomly assigning subjects to different interviewers, or by using tools such as computer-assisted telephone interviewing (CATI).

    Read more →
  • Non-personal data

    Non-personal data

    Non-Personal Data (NPD) is electronic data that does not contain any information that can be used to identify a natural person. Thus, it can either be data that has no personal information to begin with (such as weather data, stock prices, data from anonymous IoT sensors); or it is data that had personal data that was subsequently pseudoanonymized (for example, identifiable strings substituted with random strings) or anonymized (such as by irreversibly removing all personal data). NPD is part of the overall Data Governance Strategy of a region or country. While personal data are covered by Data Protection Legislation such as GDPR, other kinds of data would fall under the scope of NPD Regulation. == Importance of non-personal data == It has been pointed out that the future is data-driven. What this means is that much of the present innovation taking place in domains such as Machine Learning and Artificial Intelligence is fueled by data, which is needed for calibrating the complex models (comprising neural network-based as well as other kinds). The larger the volume, diversity and quality of the data, the higher is the quality of the model, leading to better predictions and explanations. However, there is a flip-side to data availability. The newly-emerging awareness of privacy and the consequent need for powerful Data Protection Regulations (such as GDPR) makes it increasingly difficult or impossible to obtain data in the quantities required. This is a contradiction, and the only way out would be to remove all personal data from data sets (either by Data anonymization or Pseudonymization coupled with noise injection, at which point it becomes NPD. Therefore, many innovation-friendly countries are coming out with regulatory regimes that would ensure that personal data is protected, while, at the same time, non-personal data can be extracted from personal data so that innovation is fostered. In other words, NPD 'unlocks' value that was locked away in data sets that have personally-identifiable information. It is expected that multiple NPD data sets will begin to be available on free or commercial basis from different providers once the regulations are in place. == Emerging regulatory frameworks == Non-Personal Data has significant uses that may be economic, social, political or security-related. Several countries and regions are in the process of regulating the use of NPD. In May 2019, the European Union operationalized its Regulation of the Free Flow of NPD. India announced a nine-member expert committee to make recommendations on the regulation of NPD in 2019, which published its first report in mid-2020. The report was opened for public comments, after which it was revised and published in December 2020. == Proposed NPD regulatory framework in India == The following were the objectives of the proposed Indian regulation as per the revised report: Sovereignty: India has rights over the data of India, its people and organisations. Benefit India: Benefits of data must accrue to India and its people. Benefits the world: Innovation, new models and algorithms for the world. Privacy: Misuse, reidentification and harms must be prevented. Simplicity: The regulations should be simple, digital and unambiguous. Innovation and entrepreneurship: The data should be freely available for innovation and entrepreneurship in India. == Concerns == The major concern in the use of NPD is if there are techniques (statistical or AI-based) by which multiple data sets can be used to extract personally-identifiable data.

    Read more →
  • Tensor operator

    Tensor operator

    In pure and applied mathematics, quantum mechanics and computer graphics, a tensor operator generalizes the notion of operators which are scalars and vectors. A special class of these are spherical tensor operators which apply the notion of the spherical basis and spherical harmonics. The spherical basis closely relates to the description of angular momentum in quantum mechanics and spherical harmonic functions. The coordinate-free generalization of a tensor operator is known as a representation operator. == The general notion of scalar, vector, and tensor operators == In quantum mechanics, physical observables that are scalars, vectors, and tensors, must be represented by scalar, vector, and tensor operators, respectively. Whether something is a scalar, vector, or tensor depends on how it is viewed by two observers whose coordinate frames are related to each other by a rotation. Alternatively, one may ask how, for a single observer, a physical quantity transforms if the state of the system is rotated. Consider, for example, a system consisting of a molecule of mass M {\displaystyle M} , traveling with a definite center of mass momentum, p z ^ {\displaystyle p{\mathbf {\hat {z}} }} , in the z {\displaystyle z} direction. If we rotate the system by 90 ∘ {\displaystyle 90^{\circ }} about the y {\displaystyle y} axis, the momentum will change to p x ^ {\displaystyle p{\mathbf {\hat {x}} }} , which is in the x {\displaystyle x} direction. The center-of-mass kinetic energy of the molecule will, however, be unchanged at p 2 / 2 M {\displaystyle p^{2}/2M} . The kinetic energy is a scalar and the momentum is a vector, and these two quantities must be represented by a scalar and a vector operator, respectively. By the latter in particular, we mean an operator whose expected values in the initial and the rotated states are p z ^ {\displaystyle p{\mathbf {\hat {z}} }} and p x ^ {\displaystyle p{\mathbf {\hat {x}} }} . The kinetic energy on the other hand must be represented by a scalar operator, whose expected value must be the same in the initial and the rotated states. In the same way, tensor quantities must be represented by tensor operators. An example of a tensor quantity (of rank two) is the electrical quadrupole moment of the above molecule. Likewise, the octupole and hexadecapole moments would be tensors of rank three and four, respectively. Other examples of scalar operators are the total energy operator (more commonly called the Hamiltonian), the potential energy, and the dipole-dipole interaction energy of two atoms. Examples of vector operators are the momentum, the position, the orbital angular momentum, L {\displaystyle {\mathbf {L} }} , and the spin angular momentum, S {\displaystyle {\mathbf {S} }} . (Fine print: Angular momentum is a vector as far as rotations are concerned, but unlike position or momentum it does not change sign under space inversion, and when one wishes to provide this information, it is said to be a pseudovector.) Scalar, vector and tensor operators can also be formed by products of operators. For example, the scalar product L ⋅ S {\displaystyle {\mathbf {L} }\cdot {\mathbf {S} }} of the two vector operators, L {\displaystyle {\mathbf {L} }} and S {\displaystyle {\mathbf {S} }} , is a scalar operator, which figures prominently in discussions of the spin–orbit interaction. Similarly, the quadrupole moment tensor of our example molecule has the nine components Q i j = ∑ α q α ( 3 r α , i r α , j − r α 2 δ i j ) . {\displaystyle Q_{ij}=\sum _{\alpha }q_{\alpha }\left(3r_{\alpha ,i}r_{\alpha ,j}-r_{\alpha }^{2}\delta _{ij}\right).} Here, the indices i {\displaystyle i} and j {\displaystyle j} can independently take on the values 1, 2, and 3 (or x {\displaystyle x} , y {\displaystyle y} , and z {\displaystyle z} ) corresponding to the three Cartesian axes, the index α {\displaystyle \alpha } runs over all particles (electrons and nuclei) in the molecule, q α {\displaystyle q_{\alpha }} is the charge on particle α {\displaystyle \alpha } , and r α , i {\displaystyle r_{\alpha ,i}} is the i {\displaystyle i} -th component of the position of this particle. Each term in the sum is a tensor operator. In particular, the nine products r α , i r α , j {\displaystyle r_{\alpha ,i}r_{\alpha ,j}} together form a second rank tensor, formed by taking the outer product of the vector operator r α {\displaystyle {\mathbf {r} }_{\alpha }} with itself. == Rotations of quantum states == === Quantum rotation operator === The rotation operator about the unit vector n (defining the axis of rotation) through angle θ is U [ R ( θ , n ^ ) ] = exp ⁡ ( − i θ ℏ n ^ ⋅ J ) {\displaystyle U[R(\theta ,{\hat {\mathbf {n} }})]=\exp \left(-{\frac {i\theta }{\hbar }}{\hat {\mathbf {n} }}\cdot \mathbf {J} \right)} where J = (Jx, Jy, Jz) are the rotation generators (also the angular momentum matrices): J x = ℏ 2 ( 0 1 0 1 0 1 0 1 0 ) J y = ℏ 2 ( 0 i 0 − i 0 i 0 − i 0 ) J z = ℏ ( − 1 0 0 0 0 0 0 0 1 ) {\displaystyle J_{x}={\frac {\hbar }{\sqrt {2}}}{\begin{pmatrix}0&1&0\\1&0&1\\0&1&0\end{pmatrix}}\,\quad J_{y}={\frac {\hbar }{\sqrt {2}}}{\begin{pmatrix}0&i&0\\-i&0&i\\0&-i&0\end{pmatrix}}\,\quad J_{z}=\hbar {\begin{pmatrix}-1&0&0\\0&0&0\\0&0&1\end{pmatrix}}} and let R ^ = R ^ ( θ , n ^ ) {\displaystyle {\widehat {R}}={\widehat {R}}(\theta ,{\hat {\mathbf {n} }})} be a rotation matrix. According to the Rodrigues' rotation formula, the rotation operator then amounts to U [ R ( θ , n ^ ) ] = 1 1 − i sin ⁡ θ ℏ n ^ ⋅ J − 1 − cos ⁡ θ ℏ 2 ( n ^ ⋅ J ) 2 . {\displaystyle U[R(\theta ,{\hat {\mathbf {n} }})]=1\!\!1-{\frac {i\sin \theta }{\hbar }}{\hat {\mathbf {n} }}\cdot \mathbf {J} -{\frac {1-\cos \theta }{\hbar ^{2}}}({\hat {\mathbf {n} }}\cdot \mathbf {J} )^{2}.} An operator Ω ^ {\displaystyle {\widehat {\Omega }}} is invariant under a unitary transformation U if Ω ^ = U † Ω ^ U ; {\displaystyle {\widehat {\Omega }}={U}^{\dagger }{\widehat {\Omega }}U;} in this case for the rotation U ^ ( R ) {\displaystyle {\widehat {U}}(R)} , Ω ^ = U ( R ) † Ω ^ U ( R ) = exp ⁡ ( i θ ℏ n ^ ⋅ J ) Ω ^ exp ⁡ ( − i θ ℏ n ^ ⋅ J ) . {\displaystyle {\widehat {\Omega }}={U(R)}^{\dagger }{\widehat {\Omega }}U(R)=\exp \left({\frac {i\theta }{\hbar }}{\hat {\mathbf {n} }}\cdot \mathbf {J} \right){\widehat {\Omega }}\exp \left(-{\frac {i\theta }{\hbar }}{\hat {\mathbf {n} }}\cdot \mathbf {J} \right).} === Angular momentum eigenkets === The orthonormal basis set for total angular momentum is | j , m ⟩ {\displaystyle |j,m\rangle } , where j is the total angular momentum quantum number and m is the magnetic angular momentum quantum number, which takes values −j, −j + 1, ..., j − 1, j. A general state within the j subspace | ψ ⟩ = ∑ m c j m | j , m ⟩ {\displaystyle |\psi \rangle =\sum _{m}c_{jm}|j,m\rangle } rotates to a new state by: | ψ ¯ ⟩ = U ( R ) | ψ ⟩ = ∑ m c j m U ( R ) | j , m ⟩ {\displaystyle |{\bar {\psi }}\rangle =U(R)|\psi \rangle =\sum _{m}c_{jm}U(R)|j,m\rangle } Using the completeness condition: I = ∑ m ′ | j , m ′ ⟩ ⟨ j , m ′ | {\displaystyle I=\sum _{m'}|j,m'\rangle \langle j,m'|} we have | ψ ¯ ⟩ = I U ( R ) | ψ ⟩ = ∑ m m ′ c j m | j , m ′ ⟩ ⟨ j , m ′ | U ( R ) | j , m ⟩ {\displaystyle |{\bar {\psi }}\rangle =IU(R)|\psi \rangle =\sum _{mm'}c_{jm}|j,m'\rangle \langle j,m'|U(R)|j,m\rangle } Introducing the Wigner D matrix elements: D ( R ) m ′ m ( j ) = ⟨ j , m ′ | U ( R ) | j , m ⟩ {\displaystyle {D(R)}_{m'm}^{(j)}=\langle j,m'|U(R)|j,m\rangle } gives the matrix multiplication: | ψ ¯ ⟩ = ∑ m m ′ c j m D m ′ m ( j ) | j , m ′ ⟩ ⇒ | ψ ¯ ⟩ = D ( j ) | ψ ⟩ {\displaystyle |{\bar {\psi }}\rangle =\sum _{mm'}c_{jm}D_{m'm}^{(j)}|j,m'\rangle \quad \Rightarrow \quad |{\bar {\psi }}\rangle =D^{(j)}|\psi \rangle } For one basis ket: | j , m ¯ ⟩ = ∑ m ′ D ( R ) m ′ m ( j ) | j , m ′ ⟩ {\displaystyle |{\overline {j,m}}\rangle =\sum _{m'}{D(R)}_{m'm}^{(j)}|j,m'\rangle } For the case of orbital angular momentum, the eigenstates | ℓ , m ⟩ {\displaystyle |\ell ,m\rangle } of the orbital angular momentum operator L and solutions of Laplace's equation on a 3d sphere are spherical harmonics: Y ℓ m ( θ , ϕ ) = ⟨ θ , ϕ | ℓ , m ⟩ = ( 2 ℓ + 1 ) 4 π ( ℓ − m ) ! ( ℓ + m ) ! P ℓ m ( cos ⁡ θ ) e i m ϕ {\displaystyle Y_{\ell }^{m}(\theta ,\phi )=\langle \theta ,\phi |\ell ,m\rangle ={\sqrt {{(2\ell +1) \over 4\pi }{(\ell -m)! \over (\ell +m)!}}}\,P_{\ell }^{m}(\cos {\theta })\,e^{im\phi }} where Pℓm is an associated Legendre polynomial, ℓ is the orbital angular momentum quantum number, and m is the orbital magnetic quantum number which takes the values −ℓ, −ℓ + 1, ... ℓ − 1, ℓ The formalism of spherical harmonics have wide applications in applied mathematics, and are closely related to the formalism of spherical tensors, as shown below. Spherical harmonics are functions of the polar and azimuthal angles, ϕ and θ respectively, which can be conveniently collected into a unit vector n(θ, ϕ) pointing in the direction of those angles, in the Cartesian basis it is: n ^ ( θ , ϕ ) = cos ⁡ ϕ sin ⁡ θ e x + s

    Read more →
  • Artificial intelligence industry in Canada

    Artificial intelligence industry in Canada

    The artificial intelligence industry in Canada is a rapidly expanding sector. Although Canada held a pioneering role in the early development of artificial intelligence, transforming research excellence into broad commercial adoption has proven challenging. Despite globally recognized scientific achievements and a deep pool of skilled experts, by June 2024, Canada recorded the lowest rate of AI integration among OECD countries, with only 12% of firms implementing AI in their products or services. However, AI adoption has shown significant momentum—doubling from mid-2024 to mid-2025, rising from 6.1% to 12.2%. As of September 2025, Statistics Canada indicated that while about one-third of Canadian businesses had no plans to adopt artificial intelligence in the next year, 14.5% reported intentions to begin using AI for producing goods or delivering services. The primary reasons for not moving forward with AI were lack of relevance, insufficient knowledge, and privacy concerns. According to Public Works Canada (PwC), the pace of AI adoption in Canada is roughly three-quarters of the United States rate, highlighting a notable gap between the two countries in business integration of this technology. British-Canadian computer scientist Geoffrey Hinton stated in 2025 that Canadian companies are adopting artificial intelligence at a slower pace, which may result in the loss of the country's early advantages in the field. At the "All In AI" conference held in Montreal in September 2025, the Minister of Artificial Intelligence and Digital Innovation Evan Solomon, described "Building digital sovereignty" as the most pressing democratic issue of the time. He introduced a 26-person task force focused on updating Canada's AI strategy. In their 2024 report " "Learning Together for Responsible Artificial Intelligence" report, the Innovation, Science, and Economic Development Canada stressed that public awareness, trust, and AI literacy are essential for the responsible adoption and governance of AI in Canada. Montreal workshops in 2021 expanded the OECD's 2019 definition of AI as "the set of computer techniques that enable a machine (e.g., a computer or telephone) to perform tasks that typically require intelligence, such as reasoning or learning. It is also referred to as the automation of intelligent tasks. Scientific developments in AI, such as deep-learning techniques, have made it possible to design access to huge amounts of data and ever-increasing computing power. These new techniques have been rapidly deployed on a large scale in all areas of social life, in transport, education, culture and health." == Federal investments and policy == The 2025 federal budget allocates over $1 billion over the next five years to bolster Canada's artificial intelligence and quantum computing ecosystem. == Industry landscape or research hubs == AlexNet, an influential deep convolutional neural network developed at the University of Toronto by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, marked a pivotal turning point in modern artificial intelligence. In 2012, it achieved a dramatic reduction in error rates for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), showcasing the practical power of deep learning and GPU acceleration. The success of AlexNet helped cement Canada’s reputation for AI leadership and inspired rapid adoption of deep learning across the technology sector, with ongoing impact in both academic and commercial domains. In healthcare, AlexNet has been adapted for medical imaging to assist with analyzing radiographs, mammograms, and other scans, including identifying abnormalities and supporting clinical diagnosis. In 2015, the Ottawa-based start-up Advanced Symbolics Inc. (ASI) began developing Polly, an artificial intelligence system designed to analyze and anticipate how target audiences behave—enabling more effective communication strategies and advertising campaigns. Polly was named after its first assignment analyzing the politics of Brexit. The AI gained widespread attention in 2016 for accurately forecasting both the Brexit referendum and the 2016 U.S. presidential election won by Donald Trump. The company states that Polly is used by organizations in diverse sectors—including healthcare, politics, entertainment, and mental health research—to support decision-making based on predictive analytics. Chartwatch, an AI tool developed in Canada, has been shown to reduce unexpected hospital deaths by 26% according to a 2024 study. The system analyzes patient data to detect subtle signs of deterioration, supporting healthcare teams in providing timely interventions. === Notable figures in AI in Canada === Geoffrey Hinton's decades-long work eventually formed the foundation of artificial intelligence, which earned him the Nobel Prize for physics in 2024. Yoshua Bengio, who won the Turing Award in 2018 for his pioneering work in deep learning, founded what would become Mila in 1993. Mila, is currently a collaboration between four Montreal-based academic partners.—the Pan-Canadian Artificial Intelligence Strategy includes Alberta's Amii, Toronto's Vector Institute, and Mila. Fakhreddine Karray's work on operational AI has had tangible impact across several Canadian-relevant sectors, notably intelligent transportation systems, virtual healthcare, and driver safety. === AI in the oil and gas industry === According to a 2020 Ernst & Young report the oil and gas industry in Canada is using AI in automating routine, repetitive, and dangerous tasks with technologies like robotic process automation and machine learning; optimizing production and processing; enhancing transportation logistics; improving equipment operation and monitoring; and enabling preventative maintenance. AI is also deployed for data analysis to improve prediction and decision-making, and is expected to automate up to 50% of job competencies in upstream oil and gas by 2040. Oilsands giant Suncor Energy operates a large fleet of autonomous trucks and has started using AI in its dispatch system at the Mildred Lake mine. As of 2024, AI manages routine tasks such as allocating trucks to dump stations and sending them to refuelling locations. === Indigenous and Inuit Innovation in AI === Indigenous organizations have been working on the creation of new technologies for language revitalization in partnership with National Research Council of Canada since the mid-2010s. In 2025, Inuit researchers and technology partners launched an AI-powered initiative to support the revitalization and preservation of Inuktitut, demonstrating how artificial intelligence can be adapted for Indigenous language and cultural priorities. A 2025 CBC article notes that, while AI can help revitalize Inuktitut, Inuit leaders emphasize concerns about data sovereignty, information ownership, and the need for Indigenous leadership to ensure transparency, privacy, and accountability in AI development. == Regulation == Canada's Artificial Intelligence and Data Act (AIDA) was proposed in November 2022, as part of the Digital Charter Implementation Act (Bill C-27). As well voluntary codes, such as the September 2023 Code of Conduct for Generative AI, and landmark investments in advanced computing infrastructure and the Canadian Artificial Intelligence Safety Institute (CAISI) reflect Canada's commitment to both safety and global competitiveness. == AI infrastructure == Canada has undertaken efforts to expand its AI computing infrastructure at both provincial and federal levels. The federal government's Canadian Sovereign AI Compute Strategy, allocated up to C$2 billion in Budget 2024, aims to enhance computing capacity to support domestic AI industry growth and AI adoption across the economy, with up to C$700 million designated to mobilize private sector investment in new or expanded data centres. Alberta has introduced an AI Data Centres Strategy to position itself as a leading North American destination for data centre investment, targeting C$100 billion worth of AI data centres under development by 2030. One major project under Alberta's strategy is the Wonder Valley AI Data Centre Park near Grande Prairie, which was exempted from provincial environmental impact assessment in April 2026 but still requires permits demonstrating safe construction and operation. According to Statista, as of April 2026, Canada has 287 data centres.

    Read more →
  • Reservoir sampling

    Reservoir sampling

    Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size k over the part of the population seen so far. == Motivation == Suppose we see a sequence of items, one at a time. We want to keep 10 items in memory, and we want them to be selected at random from the sequence. If we know the total number of items n and can access the items arbitrarily, then the solution is easy: select 10 distinct indices i between 1 and n with equal probability, and keep the i-th elements. The problem is that we do not always know the exact n in advance. == Simple: Algorithm R == A simple and popular but slow algorithm, Algorithm R, was created by Jeffrey Vitter. Initialize an array R {\displaystyle R} indexed from 1 {\displaystyle 1} to k {\displaystyle k} , containing the first k items of the input x 1 , . . . , x k {\displaystyle x_{1},...,x_{k}} . This is the reservoir. For each new input x i {\displaystyle x_{i}} , generate a random number j uniformly in { 1 , . . . , i } {\displaystyle \{1,...,i\}} . If j ∈ { 1 , . . . , k } {\displaystyle j\in \{1,...,k\}} , then set R [ j ] := x i . {\displaystyle R[j]:=x_{i}.} Otherwise, discard x i {\displaystyle x_{i}} . Return R {\displaystyle R} after all inputs are processed. This algorithm works by induction on i ≥ k {\displaystyle i\geq k} . While conceptually simple and easy to understand, this algorithm needs to generate a random number for each item of the input, including the items that are discarded. The algorithm's asymptotic running time is thus O ( n ) {\displaystyle O(n)} . Generating this amount of randomness and the linear run time causes the algorithm to be unnecessarily slow if the input population is large. This is Algorithm R, implemented as follows: == Optimal: Algorithm L == If we generate n {\displaystyle n} random numbers u 1 , . . . , u n ∼ U [ 0 , 1 ] {\displaystyle u_{1},...,u_{n}\sim U[0,1]} independently, then the indices of the smallest k {\displaystyle k} of them is a uniform sample of the k {\displaystyle k} -subsets of { 1 , . . . , n } {\displaystyle \{1,...,n\}} . The process can be done without knowing n {\displaystyle n} : Keep the smallest k {\displaystyle k} of u 1 , . . . , u i {\displaystyle u_{1},...,u_{i}} that has been seen so far, as well as w i {\displaystyle w_{i}} , the index of the largest among them. For each new u i + 1 {\displaystyle u_{i+1}} , compare it with u w i {\displaystyle u_{w_{i}}} . If u i + 1 < u w i {\displaystyle u_{i+1} Read more →

  • Sieve of Eratosthenes

    Sieve of Eratosthenes

    In mathematics, the sieve of Eratosthenes is an ancient algorithm for finding all prime numbers up to any given limit. It does so by iteratively marking as composite (i.e., not prime) the multiples of each prime, starting with the first prime number, 2. The multiples of a given prime are generated as a sequence of numbers starting from that prime, with constant difference between them that is equal to that prime. This is the sieve's key distinction from using trial division to sequentially test each candidate number for divisibility by each prime. Once all the multiples of each discovered prime have been marked as composites, the remaining unmarked numbers are primes. The earliest known reference to the sieve (Ancient Greek: κόσκινον Ἐρατοσθένους, kóskinon Eratosthénous) is in Nicomachus of Gerasa's Introduction to Arithmetic, an early 2nd-century CE book which attributes it to Eratosthenes of Cyrene, a 3rd-century BCE Greek mathematician, though describing the sieving by odd numbers instead of by primes. One of a number of prime number sieves, it is one of the most efficient ways to find all of the smaller primes. It may be used to find primes in arithmetic progressions. == Overview == A prime number is a natural number that has exactly two distinct natural number divisors: the number 1 and itself. To find all the prime numbers less than or equal to a given integer n by Eratosthenes's method: Create a list of consecutive integers from 2 through n: (2, 3, 4, ..., n). Initially, let p equal 2, the smallest prime number. Enumerate the multiples of p by counting in increments of p from 2p to n, and mark them in the list (these will be 2p, 3p, 4p, ...; the p itself should not be marked). Find the smallest number in the list greater than p that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3. When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n. The main idea here is that every value given to p will be prime, because if it were composite it would be marked as a multiple of some other, smaller prime. Note that some of the numbers may be marked more than once (e.g., 15 will be marked both for 3 and 5). The key property of the sieve is that only additions are needed, no multiplications or divisions are used. As a refinement, it is sufficient to mark the numbers in step 3 starting from p2, as all the smaller multiples of p will have already been marked at that point. This means that the algorithm is allowed to terminate in step 4 when p2 is greater than n. Another refinement is to initially list odd numbers only, (3, 5, ..., n), and count in increments of 2p in step 3, thus marking only odd multiples of p. This actually appears in the original algorithm. This can be generalized with wheel factorization, forming the initial list only from numbers coprime with the first few primes and not just from odds (i.e., numbers coprime with 2), and counting in the correspondingly adjusted increments so that only such multiples of p are generated that are coprime with those small primes, in the first place. === Example === To find all the prime numbers less than or equal to 30, proceed as follows. First, generate a list of natural numbers from 2 to 30: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The first number in the list is 2; cross out every 2nd number in the list after 2 by counting up from 2 in increments of 2 (these will be all the multiples of 2 in the list): 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The next number in the list after 2 is 3; cross out every 3rd number in the list after 3 by counting up from 3 in increments of 3 (these will be all the multiples of 3 in the list): 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The next number not yet crossed out in the list after 3 is 5; cross out every 5th number in the list after 5 by counting up from 5 in increments of 5 (i.e. all the multiples of 5): 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The next number not yet crossed out in the list after 5 is 7; the next step would be to cross out every 7th number in the list after 7, but they are all already crossed out at this point, as these numbers (14, 21, 28) are also multiples of smaller primes because 7 × 7 is greater than 30. The numbers not crossed out at this point in the list are all the prime numbers below 30: 2 3 5 7 11 13 17 19 23 29 == Algorithm and variants == === Pseudocode === The sieve of Eratosthenes can be expressed in pseudocode, as follows: algorithm Sieve of Eratosthenes is input: an integer n > 1. output: all prime numbers from 2 through n. let A be an array of Boolean values, indexed by integers 2 to n, initially all set to true. for i = 2, 3, 4, ..., not exceeding √n do if A[i] is true for j = i2, i2+i, i2+2i, i2+3i, ..., not exceeding n do set A[j] := false return all i such that A[i] is true. This algorithm produces all primes not greater than n. It includes a common optimization, which is to start enumerating the multiples of each prime i from i2. The time complexity of this algorithm is O(n log log n), provided the array update is an O(1) operation, as is usually the case. === Segmented sieve === As Sorenson notes, the problem with the sieve of Eratosthenes is not the number of operations it performs but rather its memory requirements. For large n, the range of primes may not fit in memory; worse, even for moderate n, its cache use is highly suboptimal. The algorithm walks through the entire array A, exhibiting almost no locality of reference. A solution to these problems is offered by segmented sieves, where only portions of the range are sieved at a time. These have been known since the 1970s, and work as follows: Divide the range 2 through n into segments of some size Δ ≥ √n. Find the primes in the first (i.e. the lowest) segment, using the regular sieve. For each of the following segments, in increasing order, with m being the segment's topmost value, find the primes in it as follows: Set up a Boolean array of size Δ. Mark as non-prime the positions in the array corresponding to the multiples of each prime p ≤ √m found so far, by enumerating its multiples in steps of p starting from the lowest multiple of p between m - Δ and m. The remaining non-marked positions in the array correspond to the primes in the segment. It is not necessary to mark any multiples of these primes, because all of these primes are larger than √m, as for k ≥ 1, one has ( k Δ + 1 ) 2 > ( k + 1 ) Δ {\displaystyle (k\Delta +1)^{2}>(k+1)\Delta } . If Δ is chosen to be √n, the space complexity of the algorithm is O(√n), while the time complexity is the same as that of the regular sieve. For ranges with upper limit n so large that the sieving primes below √n as required by the page segmented sieve of Eratosthenes cannot fit in memory, a slower but much more space-efficient sieve like the pseudosquares prime sieve, developed by Jonathan P. Sorenson, can be used instead. === Incremental sieve === An incremental formulation of the sieve generates primes indefinitely (i.e., without an upper bound) by interleaving the generation of primes with the generation of their multiples (so that primes can be found in gaps between the multiples), where the multiples of each prime p are generated directly by counting up from the square of the prime in increments of p (or 2p for odd primes). The generation must be initiated only when the prime's square is reached, to avoid adverse effects on efficiency. It can be expressed symbolically under the dataflow paradigm as primes = [2, 3, ...] \ [[p², p²+p, ...] for p in primes], using list comprehension notation with \ denoting set subtraction of arithmetic progressions of numbers. Primes can also be produced by iteratively sieving out the composites through divisibility testing by sequential primes, one prime at a time. It is not the sieve of Eratosthenes but is often confused with it, even though the sieve of Eratosthenes directly generates the composites instead of testing for them. Trial division has worse theoretical complexity than that of the sieve of Eratosthenes in generating ranges of primes. When testing each prime, the optimal trial division algorithm uses all prime numbers not exceeding its square root, whereas the sieve of Eratosthenes produces each composite from its prime factors only, and gets the primes "for free", between the composites. The widely known 1975 functional sieve code by David Turner is often presented as an example of the sieve of Eratosthenes but is actually a sub-optimal trial division sieve. == Algorithmic complexity == The sieve of Eratosthenes is a popular way to benchmark computer performance. The time complexity of calculating all primes below n in the random access machine model is O(n log log n) ope

    Read more →
  • Mixed raster content

    Mixed raster content

    Mixed raster content (MRC) is a method for compressing images that contain both binary-compressible text and continuous-tone components, using image segmentation methods to improve the level of compression and the quality of the rendered image. By separating the image into components with different compressibility characteristics, the most efficient and accurate compression algorithm for each component can be applied. MRC-compressed images are typically packaged into a hybrid file format such as DjVu and sometimes PDF. This allows for multiple images, and the instructions to properly render and reassemble them, to be stored within a single file. Some image scanners optionally support MRC when scanning to PDF. A typical manual states that without MRC, the image is generated in a single process, with text and graphics not distinguished. With MRC, separate processes are used for text, graphics, and other elements, producing clearer graphics and sharper text, at the price of slightly slower processing. MRC is recommended to optimise the scanning of documents with harder-to-read text or lower-quality graphics. MRC can also reduce the size of the scanned file, though higher compression using JBIG2 can sometimes lead to character substitution errors in scanned documents. == File format == A form of MRC is defined by international standard bodies as ISO/IEC 16485, or ITU recommendation T.44 (accessible free of charge). It defines a file format with bilevel masks and two data layers in each "stripe" of the image. The mask can be encoded in ITU T.4, JBIG1, or JBIG2, while the images can be JPEG, JBIG1, or run-length encoded color. The format is loosely based on JPEG, with a APP13 segment registered for this purpose. It is not known whether this file format is actually used, as formats like DjVu and PDF have their own ways of defining layers and masks.

    Read more →
  • Harold Borko

    Harold Borko

    Harold Borko (1922-2012) was an American psychologist and researcher working primarily in the field of information science. == Biography == Borko was born in 1922 in New York City, New York. After serving in the US Army from 1942 to 1946 he obtained a BA in Psychology from the University of California, Los Angeles in 1948 and both his MA and PhD from the University of Southern California in Psychology in 1952. He returned to the army as a psychologist until 1956 after which he began a career working in and teaching information science. He died in California in 2012. == Information Science Career == After leaving the military Borko began working at the RAND Corporation as a Systems Training Specialist in 1956 and moved to the Systems Development Corporation a year later working in the Language Processing and Retrieval department. Alongside this work he taught Psychology at USC from 1957-65 and then moved into teaching Library Science at UCLA from 1965. In 1967 Borko left his role at the Systems Development Corporation and continued as a full-time professor at UCLA until his retirement in 1993.. From 1961 to 1995 Borko authored and co-authored over 100 articles on new developments in the field as well as the historiography of information science. He served as an editor of the Journal of Educational Data Processing from 1963-1975 and as President of the American Society for Information Science in 1966 == Partial list of works == Borko, H. (1962, May). The construction of an empirically based mathematically derived classification system. In Proceedings of the May 1-3, 1962, spring joint computer conference (pp. 279-289). Borko, H., & Bernick, M. (1963). Automatic document classification. Journal of the ACM (JACM), 10(2), 151-162. Borko, H. (1964). The Storage and Retrieval of Educational Information. Journal of Teacher Education, 15(4), 449-452. Borko, H. (1964). Measuring the reliability of subject classification by men and machines. American Documentation, 15(4), 268-273. Borko, H. (1965). The conceptual foundations of information systems. Borko, H. (1968), Information science: What is it?†. Amer. Doc., 19: 3-5. https://doi.org/10.1002/asi.5090190103 Borko, H. (1970). Experiments in book indexing by computer. Information storage and retrieval, 6(1), 5-16. Borko, H. (1985). An introduction to computer-based library systems (Lucy A. Tedd). Education for Information, 3(1), 61.

    Read more →
  • Enterprise Objects Framework

    Enterprise Objects Framework

    The Enterprise Objects Framework, or simply EOF, was introduced by NeXT in 1994 as a pioneering object-relational mapping product for its NeXTSTEP and OpenStep development platforms. EOF abstracts the process of interacting with a relational database by mapping database rows to Java or Objective-C objects. This largely relieves developers from writing low-level SQL code. EOF enjoyed some niche success in the mid-1990s among financial institutions who were attracted to the rapid application development advantages of NeXT's object-oriented platform. Since Apple Inc's merger with NeXT in 1996, EOF has evolved into a fully integrated part of WebObjects, an application server also originally from NeXT. Many of the core concepts of EOF re-emerged as part of Core Data, which further abstracts the underlying data formats to allow it to be based on non-SQL stores. == History == In the early 1990s NeXT Computer recognized that connecting to databases was essential to most businesses and yet also potentially complex. Every data source has a different data-access language (or API), driving up the costs to learn and use each vendor's product. The NeXT engineers wanted to apply the advantages of object-oriented programming, by getting objects to "talk" to relational databases. As the two technologies are very different, the solution was to create an abstraction layer, insulating developers from writing the low-level procedural code (SQL) specific to each data source. The first attempt came in 1992 with the release of Database Kit (DBKit), which wrapped an object-oriented framework around any database. Unfortunately, NEXTSTEP at the time was not powerful enough and DBKit had serious design flaws. NeXT's second attempt came in 1994 with the Enterprise Objects Framework (EOF) version 1, a complete rewrite that was far more modular and OpenStep compatible. EOF 1.0 was the first product released by NeXT using the Foundation Kit and introduced autoreleased objects to the developer community. The development team at the time was only four people: Jack Greenfield, Rich Williamson, Linus Upson and Dan Willhite. EOF 2.0, released in late 1995, further refined the architecture, introducing the editing context. At that point, the development team consisted of Dan Willhite, Craig Federighi, Eric Noyau and Charly Kleissner. EOF achieved a modest level of popularity in the financial programming community in the mid-1990s, but it would come into its own with the emergence of the World Wide Web and the concept of web applications. It was clear that EOF could help companies plug their legacy databases into the Web without any rewriting of that data. With the addition of frameworks to do state management, load balancing and dynamic HTML generation, NeXT was able to launch the first object-oriented Web application server, WebObjects, in 1996, with EOF at its core. In 2000, Apple Inc. (which had merged with NeXT) officially dropped EOF as a standalone product, meaning that developers would be unable to use it to create desktop applications for the forthcoming Mac OS X. It would, however, continue to be an integral part of a major new release of WebObjects. WebObjects 5, released in 2001, was significant for the fact that its frameworks had been ported from their native Objective-C programming language to the Java language. Critics of this change argue that most of the power of EOF was a side effect of its Objective-C roots, and that EOF lost the beauty or simplicity it once had. Third-party tools, such as EOGenerator, help fill the deficiencies introduced by Java (mainly due to the loss of categories). The Objective-C code base was re-introduced with some modifications to desktop application developers as Core Data, part of Apple's Cocoa API, with the release of Mac OS X Tiger in April 2005. == How EOF works == Enterprise Objects provides tools and frameworks for object-relational mapping. The technology specializes in providing mechanisms to retrieve data from various data sources, such as relational databases via JDBC and JNDI directories, and mechanisms to commit data back to those data sources. These mechanisms are designed in a layered, abstract approach that allows developers to think about data retrieval and commitment at a higher level than a specific data source or data source vendor. Central to this mapping is a model file (an "EOModel") that you build with a visual tool — either EOModeler, or the EOModeler plug-in to Xcode. The mapping works as follows: Database tables are mapped to classes. Database columns are mapped to class attributes. Database rows are mapped to objects (or class instances). You can build data models based on existing data sources or you can build data models from scratch, which you then use to create data structures (tables, columns, joins) in a data source. The result is that database records can be transposed into Java objects. The advantage of using data models is that applications are isolated from the idiosyncrasies of the data sources they access. This separation of an application's business logic from database logic allows developers to change the database an application accesses without needing to change the application. EOF provides a level of database transparency not seen in other tools and allows the same model to be used to access different vendor databases and even allows relationships across different vendor databases without changing source code. Its power comes from exposing the underlying data sources as managed graphs of persistent objects. In simple terms, this means that it organizes the application's model layer into a set of defined in-memory data objects. It then tracks changes to these objects and can reverse those changes on demand, such as when a user performs an undo command. Then, when it is time to save changes to the application's data, it archives the objects to the underlying data sources. === Using Inheritance === In designing Enterprise Objects developers can leverage the object-oriented feature known as inheritance. A Customer object and an Employee object, for example, might both inherit certain characteristics from a more generic Person object, such as name, address, and phone number. While this kind of thinking is inherent in object-oriented design, relational databases have no explicit support for inheritance. However, using Enterprise Objects, you can build data models that reflect object hierarchies. That is, you can design database tables to support inheritance by also designing enterprise objects that map to multiple tables or particular views of a database table. == Enterprise Objects (EOs) == An Enterprise Object is analogous to what is often known in object-oriented programming as a business object — a class which models a physical or conceptual object in the business domain (e.g. a customer, an order, an item, etc.). What makes an EO different from other objects is that its instance data maps to a data store. Typically, an enterprise object contains key-value pairs that represent a row in a relational database. The key is basically the column name, and the value is what was in that row in the database. So it can be said that an EO's properties persist beyond the life of any particular running application. More precisely, an Enterprise Object is an instance of a class that implements the com.webobjects.eocontrol.EOEnterpriseObject interface. An Enterprise Object has a corresponding model (called an EOModel) that defines the mapping between the class's object model and the database schema. However, an enterprise object doesn't explicitly know about its model. This level of abstraction means that database vendors can be switched without it affecting the developer's code. This gives Enterprise Objects a high degree of reusability. == EOF and Core Data == Despite their common origins, the two technologies diverged, with each technology retaining a subset of the features of the original Objective-C code base, while adding some new features. === Features Supported Only by EOF === EOF supports custom SQL; shared editing contexts; nested editing contexts; and pre-fetching and batch faulting of relationships, all features of the original Objective-C implementation not supported by Core Data. Core Data also does not provide the equivalent of an EOModelGroup—the NSManagedObjectModel class provides methods for merging models from existing models, and for retrieving merged models from bundles. === Features Supported Only by Core Data === Core Data supports fetched properties; multiple configurations within a managed object model; local stores; and store aggregation (the data for a given entity may be spread across multiple stores); customization and localization of property names and validation warnings; and the use of predicates for property validation. These features of the original Objective-C implementation are not supported by the Java implementation.

    Read more →
  • System of record

    System of record

    A system of record (SOR) or source system of record (SSoR) is a data management term for an information storage system (commonly implemented on a computer system running a database management system) that is the authoritative data source for a given data element or piece of information, like for example a row (or record) in a table. In data vault it is referred to as the record source. == Background == The need to identify systems of record can become acute in organizations where management information systems have been built by taking output data from multiple source systems, re-processing this data, and then re-presenting the result for a new business use. In these cases, multiple information systems may disagree about the same piece of information. These disagreements may stem from semantic differences, differences in opinion, use of different sources, differences in the timing of the extract, transform, load processes that create the data they report against, or may simply be the result of bugs. == Use == The integrity and validity of any data set is open to question when there is no traceable connection to a good source, and listing a source system of record is a solution to this. Where the integrity of the data is vital, if there is an agreed system of record, the data element must either be linked to, or extracted directly from it. In other cases, the provenance and estimated data quality should be documented. The "system of record" approach is a good fit for environments where both: there is a single authority over all data consumers, and all consumers have similar needs == Trade-offs == In diverse environments, one instead needs to support the presence of multiple opinions. Consumers may accept different authorities or may differ on what constitutes an authoritative source—researchers may prefer carefully vetted data, while tactical military systems may require the most recent credible report.

    Read more →
  • ArcSoft ShowBiz

    ArcSoft ShowBiz

    ShowBiz is a video editor by ArcSoft for the Windows operating system. It can create VCD and DVDs and can also export to the formats AVI, MPEG, WMV, and MOV. ShowBiz also contains a DVD burning and menu building feature. As of 2003, it was one of the three most dominant bundled titles. == Reception == PC Magazine reviewer Jan Ozer states: "ArcSoft's ShowBiz has evolved into a competent editor that's generally more usable than Dazzle's MovieStar program, providing more configuration controls, better preview features, and a much greater range of fun effects." John Virata, senior editor of Digital Media Online, says in his three page review of ShowBiz DVD 2, "It is an easy editor to work with and has a logically laid out interface that takes you step by step through the video creation and DVD creation process"

    Read more →
  • CENDI

    CENDI

    CENDI (Commerce, Energy, NASA, Defense Information Managers Group) is an interagency group of senior Scientific and Technical Information (STI) managers from 14 United States federal agencies. CENDI managers cooperate by exchanging information and ideas, collaborating to address common issues, and undertaking joint initiatives. CENDI's accomplishments range from impacting federal information policy to educating a broad spectrum of stakeholders on all aspects of federal STI systems, including its value to research and the taxpayer, and to operational improvements in agency and interagency STI operations. == History == CENDI traces its roots to the Committee on Scientific and Technical Information (COSATI) of the Federal Council on Science and Technology. COSATI was established in the early 1960s to coordinate the management of the results from the U.S. government's increasing commitment to scientific research and technology development. The scientific and technical information (STI) managers of the government's major research and development (R&D) agencies worked within COSATI to standardize guidelines for cataloging and indexing technical reports. COSATI ceased formal operations in the early 1970s. To continue the cooperation begun under COSATI, managers of agency STI programs from Commerce (National Technical Information Service), Energy (Office of Scientific and Technical Information), NASA (HQ/STI Division), and Defense (Defense Technical Information Center) began meeting periodically to discuss common topics and stimulate more effective cooperation. In 1985, a Memorandum of Understanding was signed by the four charter agencies and CENDI was established. From this small core of STI managers, CENDI has grown to its current membership, which represents the major science agencies, the national libraries, and agencies involved in the dissemination and long-term management of scientific and technical information. The vision of CENDI is to facilitate cooperative enterprise where capabilities are shared and challenges are faced together so that the sum of the accomplishments is greater than each individual agency can achieve on its own amongst federal STI agencies. The abbreviation CENDI refers to the "Commerce, Energy, NASA, Defense Information Managers Group". == Membership == New members from other federal R&D information organizations may be admitted by unanimous agreement of the members. However, it is the intent of the group that membership in CENDI should remain small and focus on organizations with STI or supporting responsibilities. Each agency provides funding to CENDI. == Members == The members of CENDI are: Defense Technical Information Center (United States Department of Defense) Office of Research and Development and Office of Environmental Information (United States Environmental Protection Agency) Government Printing Office Library of Congress NASA Scientific and Technical Information Program National Agricultural Library (United States Department of Agriculture) National Archives and Records Administration National Library of Education (United States Department of Education) National Library of Medicine (United States Department of Health and Human Services) National Science Foundation National Technical Information Service (United States Department of Commerce) National Transportation Library (United States Department of Transportation) Office of Scientific and Technical Information (United States Department of Energy) USGS/Biological Resources Discipline (United States Department of the Interior) == Mission and operation == CENDI's mission is to help improve the productivity of federal science- and technology-based programs through effective scientific, technical, and related information support systems. In fulfilling its mission, CENDI agencies play an important role in addressing science- and technology-based national priorities and strengthening U.S. competitiveness. === Goals === STI Coordination and Leadership: Provide coordination and leadership for information exchange on important STI policy issues. Improvement of STI Systems: Promote the development of improved STI systems through the productive interrelationship of content and technology. STI Understanding: Promote better understanding of STI and STI management. === Principals and Alternates === CENDI is made up of senior federal STI managers and each organization appoints a Principal representative. This person is the point of contact for that organization within CENDI. Each Principal has an Alternate. The Principals and Alternates comprise the main group that meets on a regular basis, usually every other month. === Secretariat === A Tennessee-based information management company, -- Information International Associates, Inc., currently serves as the CENDI Secretariat. The Secretariat provides day-to-day operations to CENDI. The Secretariat prepares the necessary materials for the Principals' meetings, provides support for the working group and task group meetings, assists in developing papers, and maintains the CENDI files and outreach tools. === Task Groups and Working Groups === The chair(s) of a working group is appointed by the Principals and has the overall responsibility for the group's activities. The Secretariat provides support at the request of the Working Group chair(s). The Working Groups and Task Groups that are currently operating are: Copyright and Intellectual Property Working Group Distribution Markings Task Group Digital Preservation Task Group Digitization Specifications Task Group Image Metadata Task Group Science.gov (see below) STI Policy Working Group Terminology Resources Task Group === Science.gov and Worldwidescience.org === In 2001, in response to the April 2001 workshop on "Strengthening the Public Information Infrastructure for Science", and taking into consideration a request from Firstgov (now USA.gov) to develop specialized topical portals, CENDI formed an alliance to develop an interagency website for access to STI. This website, called Science.gov, is a one-stop source of STI, including both selected, authoritative government websites and deep Web databases of technical reports, journal articles, conference proceedings, and other published materials. Through the volunteer efforts of members and involving over 100 staff, content and architecture is developed for the site. The Science.gov website is hosted by the Department of Energy (DOE) Office of Scientific and Technical Information (OSTI). The site was formally launched in December 2002. As a result of the success of Science.gov, under DOE leadership and in cooperation with the International Council of Scientific and Technical Information, a worldwide coordination across national portals called WorldWideScience was launched in 2008. === Work with non-member organizations === CENDI works with several cooperating non-member organizations on a regular basis. These agencies are in academia, federal government, legal and policy analysis, international, non-governmental, and private organizations.

    Read more →
  • NoSQL

    NoSQL

    NoSQL (originally meaning "not only SQL" or "non-relational") refers to a type of database design that stores and retrieves data differently from the traditional table-based structure of relational databases. Unlike relational databases, which organize data into rows and columns like a spreadsheet, NoSQL databases use a single data structure—such as key–value pairs, wide columns, graphs, or documents—to hold information. Since this non-relational design does not require a fixed schema, it scales easily to manage large, often unstructured datasets. NoSQL systems are sometimes called "Not only SQL" because they can support SQL-like query languages or work alongside SQL databases in polyglot-persistent setups, where multiple database types are combined. Non-relational databases date back to the late 1960s, but the term "NoSQL" emerged in the early 2000s, spurred by the needs of Web 2.0 companies like social media platforms. NoSQL databases are popular in big data and real-time web applications due to their simple design, ability to scale across clusters of machines (called horizontal scaling), and precise control over data availability. These structures can speed up certain tasks and are often considered more adaptable than fixed database tables. However, many NoSQL systems prioritize speed and availability over strict consistency (per the CAP theorem), using eventual consistency—where updates reach all nodes eventually, typically within milliseconds, but may cause brief delays in accessing the latest data, known as stale reads. While most lack full ACID transaction support, some, like MongoDB, include it as a key feature. == Barriers to adoption == Barriers to wider NoSQL adoption include their use of low-level query languages instead of SQL, inability to perform ad hoc joins across tables, lack of standardized interfaces, and significant investments already made in relational databases. Some NoSQL systems risk losing data through lost writes or other forms, though features like write-ahead logging—a method to record changes before they’re applied—can help prevent this. For distributed transaction processing across multiple databases, keeping data consistent is a challenge for both NoSQL and relational systems, as relational databases cannot enforce rules linking separate databases, and few systems support both ACID transactions and X/Open XA standards for managing distributed updates. Limitations within the interface environment are overcome using semantic virtualization protocols, such that NoSQL services are accessible to most operating systems. == History == The term NoSQL was used by Carlo Strozzi in 1998 to name his lightweight Strozzi NoSQL open-source relational database that did not expose the standard Structured Query Language (SQL) interface, but was still relational. His NoSQL RDBMS is distinct from the around-2009 general concept of NoSQL databases. Strozzi suggests that, because the current NoSQL movement "departs from the relational model altogether, it should therefore have been called more appropriately 'NoREL'", referring to "not relational". Johan Oskarsson, then a developer at Last.fm, reintroduced the term NoSQL in early 2009 when he organized an event to discuss "open-source distributed, non-relational databases". The name attempted to label the emergence of an increasing number of non-relational, distributed data stores, including open source clones of Google's Bigtable/MapReduce and Amazon's DynamoDB. == Types and examples == There are various ways to classify NoSQL databases, with different categories and subcategories, some of which overlap. What follows is a non-exhaustive classification by data model, with examples: === Key–value store === Key–value (KV) stores use the associative array (also called a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key–value pairs, such that each possible key appears at most once in the collection. The key–value model is one of the simplest non-trivial data models, and richer data models are often implemented as an extension of it. The key–value model can be extended to a discretely ordered model that maintains keys in lexicographic order. This extension is computationally powerful, in that it can efficiently retrieve selective key ranges. Key–value stores can use consistency models ranging from eventual consistency to serializability. Some databases support ordering of keys. There are various hardware implementations, and some users store data in memory (RAM), while others on solid-state drives (SSD) or rotating disks (aka hard disk drive (HDD)). === Document store === The central concept of a document store is that of a "document". While the details of this definition differ among document-oriented databases, they all assume that documents encapsulate and encode data (or information) in some standard formats or encodings. Encodings in use include XML, YAML, and JSON and binary forms like BSON. Documents are addressed in the database via a unique key that represents that document. Another defining characteristic of a document-oriented database is an API or query language to retrieve documents based on their contents. Different implementations offer different ways of organizing and/or grouping documents: Collections Tags Non-visible metadata Directory hierarchies Compared to relational databases, collections could be considered analogous to tables and documents analogous to records. But they are different – every record in a table has the same sequence of fields, while documents in a collection may have fields that are completely different. === Graph === Graph databases are designed for data whose relations are well represented as a graph consisting of elements connected by a finite number of relations. Examples of data include social relations, public transport links, road maps, network topologies, etc. Graph databases and their query language == Performance == The performance of NoSQL databases is usually evaluated using the metric of throughput, which is measured as operations per second. Performance evaluation must pay attention to the right benchmarks such as production configurations, parameters of the databases, anticipated data volume, and concurrent user workloads. Ben Scofield rated different categories of NoSQL databases as follows: Performance and scalability comparisons are most commonly done using the YCSB benchmark. == Handling relational data == Since most NoSQL databases lack ability for joins in queries, the database schema generally needs to be designed differently. There are three main techniques for handling relational data in a NoSQL database. (See table join and ACID support for NoSQL databases that support joins.) === Multiple queries === Instead of retrieving all the data with one query, it is common to do several queries to get the desired data. NoSQL queries are often faster than traditional SQL queries, so the cost of additional queries may be acceptable. If an excessive number of queries would be necessary, one of the other two approaches is more appropriate. === Caching, replication and non-normalized data === Instead of only storing foreign keys, it is common to store actual foreign values along with the model's data. For example, each blog comment might include the username in addition to a user id, thus providing easy access to the username without requiring another lookup. When a username changes, however, this will now need to be changed in many places in the database. Thus this approach works better when reads are much more common than writes. === Nesting data === With document databases like MongoDB it is common to put more data in a smaller number of collections. For example, in a blogging application, one might choose to store comments within the blog post document, so that with a single retrieval one gets all the comments. Thus in this approach a single document contains all the data needed for a specific task. == ACID and join support == A database is marked as supporting ACID properties (atomicity, consistency, isolation, durability) or join operations if the documentation for the database makes that claim. However, this doesn't necessarily mean that the capability is fully supported in a manner similar to most SQL databases. == Query optimization and indexing in NoSQL databases == Different NoSQL databases, such as DynamoDB, MongoDB, Cassandra, Couchbase, HBase, and Redis, exhibit varying behaviors when querying non-indexed fields. Many perform full-table or collection scans for such queries, applying filtering operations after retrieving data. However, modern NoSQL databases often incorporate advanced features to optimize query performance. For example, MongoDB supports compound indexes and query-optimization strategies, Cassandra offers secondary indexes and materialized views, and Redis employs custom indexing mechanisms tailored to specific use cases. Systems like El

    Read more →