AI Humanizer

AI Humanizer — hands-on reviews, top picks, pricing, pros and cons and a practical how-to guide on Aizhi.

  • Hint (app)

    Hint (app)

    Hint (hint.app) is an American software platform that provides astrological content, personality assessments, and relationship compatibility tools. The application was launched in 2018 and is based in Claymont, Delaware. The platform has been described in media coverage as part of a broader trend of astrology-based and self-reflection applications, particularly among younger users. As of 2026, the company reports that it has reached more than 25 million users worldwide. == History == Hint was founded in 2018 and is headquartered in Claymont, Delaware. The platform was developed to address a growing demand among Millennials and Gen Z for structured self-reflection tools that deviate from traditional religious or clinical psychological frameworks. The app has become a prominent figure in the "emotional technology" sector, reaching over 25 million global users by 2026. The platform is frequently cited by sociologists and media outlets as a primary driver of the Open-source intelligence trend, where individuals use digital tools to vet and analyze personal relationships in the dating economy. Media coverage has described the platform as part of a broader trend in which digital tools incorporate astrology and symbolic frameworks into wellness and relationship advice. == Reception == Coverage of Hint has appeared alongside reporting on changing attitudes toward dating and relationships, particularly among younger adults. Surveys reported by media outlets have described shifts in dating behavior, including reduced interest in casual relationships and increased reliance on digital tools for emotional reflection and compatibility assessment. Additional reporting has linked the use of astrology apps to broader trends in emotional fatigue and changing relationship expectations. Lifestyle and culture publications have described Hint, as an example of applications that integrate astrology into digital self-reflection and relationship analysis.

    Read more →
  • Nearest centroid classifier

    Nearest centroid classifier

    In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied to text classification using word vectors containing tfidf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback. An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors. == Algorithm == === Training === Given labeled training samples { ( x → 1 , y 1 ) , … , ( x → n , y n ) } {\displaystyle \textstyle \{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}} with class labels y i ∈ Y {\displaystyle y_{i}\in \mathbf {Y} } , compute the per-class centroids μ → ℓ = 1 | C ℓ | ∑ i ∈ C ℓ x → i {\displaystyle \textstyle {\vec {\mu }}_{\ell }={\frac {1}{|C_{\ell }|}}{\underset {i\in C_{\ell }}{\sum }}{\vec {x}}_{i}} where C ℓ {\displaystyle C_{\ell }} is the set of indices of samples belonging to class ℓ ∈ Y {\displaystyle \ell \in \mathbf {Y} } . === Prediction === The class assigned to an observation x → {\displaystyle {\vec {x}}} is y ^ = arg ⁡ min ℓ ∈ Y ‖ μ → ℓ − x → ‖ {\displaystyle {\hat {y}}={\arg \min }_{\ell \in \mathbf {Y} }\|{\vec {\mu }}_{\ell }-{\vec {x}}\|} .

    Read more →
  • Vapnik–Chervonenkis theory

    Vapnik–Chervonenkis theory

    Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view. == Introduction == VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory): Theory of consistency of learning processes What are (necessary and sufficient) conditions for consistency of a learning process based on the empirical risk minimization principle? Nonasymptotic theory of the rate of convergence of learning processes How fast is the rate of convergence of the learning process? Theory of controlling the generalization ability of learning processes How can one control the rate of convergence (the generalization ability) of the learning process? Theory of constructing learning machines How can one construct algorithms that can control the generalization ability? VC Theory is a major subbranch of statistical learning theory. One of its main applications in statistical learning theory is to provide generalization conditions for learning algorithms. From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization. In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics. == Overview of VC theory in empirical processes == === Background on empirical processes === Let ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} be a measurable space. For any measure Q {\displaystyle Q} on ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} , and any measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } , define Q f = ∫ f d Q {\displaystyle Qf=\int fdQ} Measurability issues will be ignored here, for more technical detail see. Let F {\displaystyle {\mathcal {F}}} be a class of measurable functions f : X → R {\displaystyle f:{\mathcal {X}}\to \mathbf {R} } and define: ‖ Q ‖ F = sup { | Q f | : f ∈ F } . {\displaystyle \|Q\|_{\mathcal {F}}=\sup\{\vert Qf\vert \ :\ f\in {\mathcal {F}}\}.} Let X 1 , … , X n {\displaystyle X_{1},\ldots ,X_{n}} be independent, identically distributed random elements of ( X , A ) {\displaystyle ({\mathcal {X}},{\mathcal {A}})} . Then define the empirical measure P n = n − 1 ∑ i = 1 n δ X i , {\displaystyle \mathbb {P} _{n}=n^{-1}\sum _{i=1}^{n}\delta _{X_{i}},} where δ here stands for the Dirac measure. The empirical measure induces a map F → R {\displaystyle {\mathcal {F}}\to \mathbf {R} } given by: f ↦ P n f = 1 n ( f ( X 1 ) + . . . + f ( X n ) ) {\displaystyle f\mapsto \mathbb {P} _{n}f={\frac {1}{n}}(f(X_{1})+...+f(X_{n}))} Now suppose P is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes F {\displaystyle {\mathcal {F}}} for which statements such as the following hold: uniform law of large numbers: ‖ P n − P ‖ F → n 0 , {\displaystyle \|\mathbb {P} _{n}-P\|_{\mathcal {F}}{\underset {n}{\to }}0,} That is, as n → ∞ {\displaystyle n\to \infty } , | 1 n ( f ( X 1 ) + . . . + f ( X n ) ) − ∫ f d P | → 0 {\displaystyle \left|{\frac {1}{n}}(f(X_{1})+...+f(X_{n}))-\int fdP\right|\to 0} uniformly for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . uniform central limit theorem: G n = n ( P n − P ) ⇝ G , in ℓ ∞ ( F ) {\displaystyle \mathbb {G} _{n}={\sqrt {n}}(\mathbb {P} _{n}-P)\rightsquigarrow \mathbb {G} ,\quad {\text{in }}\ell ^{\infty }({\mathcal {F}})} In the former case F {\displaystyle {\mathcal {F}}} is called Glivenko–Cantelli class, and in the latter case (under the assumption ∀ x , sup f ∈ F | f ( x ) − P f | < ∞ {\displaystyle \forall x,\sup \nolimits _{f\in {\mathcal {F}}}\vert f(x)-Pf\vert <\infty } ) the class F {\displaystyle {\mathcal {F}}} is called Donsker or P-Donsker. A Donsker class is Glivenko–Cantelli in probability by an application of Slutsky's theorem. These statements are true for a single f {\displaystyle f} , by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all f ∈ F {\displaystyle f\in {\mathcal {F}}} . Intuitively then, the set F {\displaystyle {\mathcal {F}}} cannot be too large, and as it turns out that the geometry of F {\displaystyle {\mathcal {F}}} plays a very important role. One way of measuring how big the function set F {\displaystyle {\mathcal {F}}} is to use the so-called covering numbers. The covering number N ( ε , F , ‖ ⋅ ‖ ) {\displaystyle N(\varepsilon ,{\mathcal {F}},\|\cdot \|)} is the minimal number of balls { g : ‖ g − f ‖ < ε } {\displaystyle \{g:\|g-f\|<\varepsilon \}} needed to cover the set F {\displaystyle {\mathcal {F}}} (here it is obviously assumed that there is an underlying norm on F {\displaystyle {\mathcal {F}}} ). The entropy is the logarithm of the covering number. Two sufficient conditions are provided below, under which it can be proved that the set F {\displaystyle {\mathcal {F}}} is Glivenko–Cantelli or Donsker. A class F {\displaystyle {\mathcal {F}}} is P-Glivenko–Cantelli if it is P-measurable with envelope F such that P ∗ F < ∞ {\displaystyle P^{\ast }F<\infty } and satisfies: ∀ ε > 0 sup Q N ( ε ‖ F ‖ Q , F , L 1 ( Q ) ) < ∞ . {\displaystyle \forall \varepsilon >0\quad \sup \nolimits _{Q}N(\varepsilon \|F\|_{Q},{\mathcal {F}},L_{1}(Q))<\infty .} The next condition is a version of Dudley's theorem. If F {\displaystyle {\mathcal {F}}} is a class of functions such that ∫ 0 ∞ sup Q log ⁡ N ( ε ‖ F ‖ Q , 2 , F , L 2 ( Q ) ) d ε < ∞ {\displaystyle \int _{0}^{\infty }\sup \nolimits _{Q}{\sqrt {\log N\left(\varepsilon \|F\|_{Q,2},{\mathcal {F}},L_{2}(Q)\right)}}d\varepsilon <\infty } then F {\displaystyle {\mathcal {F}}} is P-Donsker for every probability measure P such that P ∗ F 2 < ∞ {\displaystyle P^{\ast }F^{2}<\infty } . In the last integral, the notation means ‖ f ‖ Q , 2 = ( ∫ | f | 2 d Q ) 1 2 {\displaystyle \|f\|_{Q,2}=\left(\int |f|^{2}dQ\right)^{\frac {1}{2}}} . === Symmetrization === The majority of the arguments about how to bound the empirical process rely on symmetrization, maximal and concentration inequalities, and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section). It is presented here: Consider the empirical process: f ↦ ( P n − P ) f = 1 n ∑ i = 1 n ( f ( X i ) − P f ) {\displaystyle f\mapsto (\mathbb {P} _{n}-P)f={\dfrac {1}{n}}\sum _{i=1}^{n}(f(X_{i})-Pf)} Turns out that there is a connection between the empirical and the following symmetrized process: f ↦ P n 0 f = 1 n ∑ i = 1 n ε i f ( X i ) {\displaystyle f\mapsto \mathbb {P} _{n}^{0}f={\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})} The symmetrized process is a Rademacher process, conditionally on the data X i {\displaystyle X_{i}} . Therefore, it is a sub-Gaussian process by Hoeffding's inequality. Lemma (Symmetrization). For every nondecreasing, convex Φ: R → R and class of measurable functions F {\displaystyle {\mathcal {F}}} , E Φ ( ‖ P n − P ‖ F ) ≤ E Φ ( 2 ‖ P n 0 ‖ F ) {\displaystyle \mathbb {E} \Phi (\|\mathbb {P} _{n}-P\|_{\mathcal {F}})\leq \mathbb {E} \Phi \left(2\left\|\mathbb {P} _{n}^{0}\right\|_{\mathcal {F}}\right)} The proof of the Symmetrization lemma relies on introducing independent copies of the original variables X i {\displaystyle X_{i}} (sometimes referred to as a ghost sample) and replacing the inner expectation of the LHS by these copies. After an application of Jensen's inequality different signs could be introduced (hence the name symmetrization) without changing the expectation. The proof can be found below because of its instructive nature. The same proof method can be used to prove the Glivenko–Cantelli theorem. A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to P n 0 {\displaystyle \mathbb {P} _{n}^{0}} and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties. === VC Connection === It turns out that there is a fascinating connection between certain combinatorial properties of the set F {\displaystyle {\mathcal {F}}} and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik–Chervonenkis classes of sets – or shortly VC sets. Consider a collection C {\displaystyle {\mathcal {C}}} of subsets of the sample space X {\displaystyle

    Read more →
  • Semidefinite embedding

    Semidefinite embedding

    Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is an algorithm in computer science that uses semidefinite programming to perform non-linear dimensionality reduction of high-dimensional vectorial input data. It is motivated by the observation that kernel Principal Component Analysis (kPCA) does not reduce the data dimensionality, as it leverages the Kernel trick to non-linearly map the original data into an inner-product space. == Algorithm == MVU creates a mapping from the high dimensional input vectors to some low dimensional Euclidean vector space in the following steps: A neighbourhood graph is created. Each input is connected with its k-nearest input vectors (according to Euclidean distance metric) and all k-nearest neighbors are connected with each other. If the data is sampled well enough, the resulting graph is a discrete approximation of the underlying manifold. The neighbourhood graph is "unfolded" with the help of semidefinite programming. Instead of learning the output vectors directly, the semidefinite programming aims to find an inner product matrix that maximizes the pairwise distances between any two inputs that are not connected in the neighbourhood graph while preserving the nearest neighbors distances. The low-dimensional embedding is finally obtained by application of multidimensional scaling on the learned inner product matrix. The steps of applying semidefinite programming followed by a linear dimensionality reduction step to recover a low-dimensional embedding into a Euclidean space were first proposed by Linial, London, and Rabinovich. == Optimization formulation == Let X {\displaystyle X\,\!} be the original input and Y {\displaystyle Y\,\!} be the embedding. If i , j {\displaystyle i,j\,\!} are two neighbors, then the local isometry constraint that needs to be satisfied is: | X i − X j | 2 = | Y i − Y j | 2 {\displaystyle |X_{i}-X_{j}|^{2}=|Y_{i}-Y_{j}|^{2}\,\!} Let G , K {\displaystyle G,K\,\!} be the Gram matrices of X {\displaystyle X\,\!} and Y {\displaystyle Y\,\!} (i.e.: G i j = X i ⋅ X j , K i j = Y i ⋅ Y j {\displaystyle G_{ij}=X_{i}\cdot X_{j},K_{ij}=Y_{i}\cdot Y_{j}\,\!} ). We can express the above constraint for every neighbor points i , j {\displaystyle i,j\,\!} in term of G , K {\displaystyle G,K\,\!} : G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i {\displaystyle G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji}\,\!} In addition, we also want to constrain the embedding Y {\displaystyle Y\,\!} to center at the origin: 0 = | ∑ i Y i | 2 ⇔ ( ∑ i Y i ) ⋅ ( ∑ i Y i ) ⇔ ∑ i , j Y i ⋅ Y j ⇔ ∑ i , j K i j {\displaystyle 0=|\sum _{i}Y_{i}|^{2}\Leftrightarrow (\sum _{i}Y_{i})\cdot (\sum _{i}Y_{i})\Leftrightarrow \sum _{i,j}Y_{i}\cdot Y_{j}\Leftrightarrow \sum _{i,j}K_{ij}} As described above, except the distances of neighbor points are preserved, the algorithm aims to maximize the pairwise distance of every pair of points. The objective function to be maximized is: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}} Intuitively, maximizing the function above is equivalent to pulling the points as far away from each other as possible and therefore "unfold" the manifold. The local isometry constraint Let τ = m a x { η i j | Y i − Y j | 2 } {\displaystyle \tau =max\{\eta _{ij}|Y_{i}-Y_{j}|^{2}\}\,\!} where η i j := { 1 if i is a neighbour of j 0 otherwise . {\displaystyle \eta _{ij}:={\begin{cases}1&{\mbox{if}}\ i{\mbox{ is a neighbour of }}j\\0&{\mbox{otherwise}}.\end{cases}}} prevents the objective function from diverging (going to infinity). Since the graph has N points, the distance between any two points | Y i − Y j | 2 ≤ N τ {\displaystyle |Y_{i}-Y_{j}|^{2}\leq N\tau \,\!} . We can then bound the objective function as follows: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 ≤ 1 2 N ∑ i , j ( N τ ) 2 = N 3 τ 2 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\leq {\dfrac {1}{2N}}\sum _{i,j}(N\tau )^{2}={\dfrac {N^{3}\tau ^{2}}{2}}\,\!} The objective function can be rewritten purely in the form of the Gram matrix: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 = 1 2 N ∑ i , j ( Y i 2 + Y j 2 − Y i ⋅ Y j − Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − ∑ i , j Y i ⋅ Y j − ∑ i , j Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − 0 − 0 ) = 1 N ( ∑ i Y i 2 ) = 1 N ( T r ( K ) ) {\displaystyle {\begin{aligned}T(Y)&{}={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\\&{}={\dfrac {1}{2N}}\sum _{i,j}(Y_{i}^{2}+Y_{j}^{2}-Y_{i}\cdot Y_{j}-Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-\sum _{i,j}Y_{i}\cdot Y_{j}-\sum _{i,j}Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-0-0)\\&{}={\dfrac {1}{N}}(\sum _{i}Y_{i}^{2})={\dfrac {1}{N}}(Tr(K))\\\end{aligned}}\,\!} Finally, the optimization can be formulated as: Maximize T r ( K ) subject to K ⪰ 0 , ∑ i j K i j = 0 and G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i , ∀ i , j where η i j = 1 , {\displaystyle {\begin{aligned}&{\text{Maximize}}&&Tr(\mathbf {K} )\\&{\text{subject to}}&&\mathbf {K} \succeq 0,\sum _{ij}\mathbf {K} _{ij}=0\\&{\text{and}}&&G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji},\forall i,j{\mbox{ where }}\eta _{ij}=1,\end{aligned}}} After the Gram matrix K {\displaystyle K\,\!} is learned by semidefinite programming, the output Y {\displaystyle Y\,\!} can be obtained via Cholesky decomposition. In particular, the Gram matrix can be written as K i j = ∑ α = 1 N ( λ α V α i V α j ) {\displaystyle K_{ij}=\sum _{\alpha =1}^{N}(\lambda _{\alpha }V_{\alpha i}V_{\alpha j})\,\!} where V α i {\displaystyle V_{\alpha i}\,\!} is the i-th element of eigenvector V α {\displaystyle V_{\alpha }\,\!} of the eigenvalue λ α {\displaystyle \lambda _{\alpha }\,\!} . It follows that the α {\displaystyle \alpha \,\!} -th element of the output Y i {\displaystyle Y_{i}\,\!} is λ α V α i {\displaystyle {\sqrt {\lambda _{\alpha }}}V_{\alpha i}\,\!} .

    Read more →
  • RCUDA

    RCUDA

    rCUDA, which stands for Remote CUDA, is a type of middleware software framework for remote GPU virtualization. Fully compatible with the CUDA application programming interface (API), it allows the allocation of one or more CUDA-enabled GPUs to a single application. Each GPU can be part of a cluster or running inside of a virtual machine. The approach is aimed at improving performance in GPU clusters that are lacking full utilization. GPU virtualization reduces the number of GPUs needed in a cluster, and in turn, leads to a lower cost configuration – less energy, acquisition, and maintenance. The recommended distributed acceleration architecture is a high performance computing cluster with GPUs attached to only a few of the cluster nodes. When a node without a local GPU executes an application needing GPU resources, remote execution of the kernel is supported by data and code transfers between local system memory and remote GPU memory. rCUDA is designed to accommodate this client-server architecture. On one end, clients employ a library of wrappers to the high-level CUDA Runtime API, and on the other end, there is a network listening service that receives requests on a TCP port. Several nodes running different GPU-accelerated applications can concurrently make use of the whole set of accelerators installed in the cluster. The client forwards the request to one of the servers, which accesses the GPU installed in that computer and executes the request in it. Time-multiplexing the GPU, or in other words sharing it, is accomplished by spawning different server processes for each remote GPU execution request. == rCUDA v20.07 == The rCUDA middleware enables the concurrent usage of CUDA-compatible devices remotely. rCUDA employs either the InfiniBand network or the socket API for the communication between clients and servers. rCUDA can be useful in three different environments: Clusters. To reduce the number of GPUs installed in High Performance Clusters. This leads to energy savings, as well as other related savings like acquisition costs, maintenance, space, cooling, etc. Academia. In commodity networks, to offer access to a few high performance GPUs concurrently to many students. Virtual Machines. To enable the access to the CUDA facilities on the physical machine. The current version of rCUDA (v20.07) supports CUDA version 9.0, excluding graphics interoperability. rCUDA v20.07 targets the Linux OS (for 64-bit architectures) on both client and server sides. CUDA applications do not need any change in their source code in order to be executed with rCUDA.

    Read more →
  • Prescription monitoring program

    Prescription monitoring program

    In the United States, prescription monitoring programs (PMPs) or prescription drug monitoring programs (PDMPs) are state-run programs which collect and distribute data about the prescription and dispensation of federally controlled substances and, depending on state requirements, other potentially abusable prescription drugs. PMPs are meant to help prevent adverse drug-related events such as opioid overdoses, drug diversion, and substance abuse by decreasing the amount and/or frequency of opioid prescribing, and by identifying those patients who are obtaining prescriptions from multiple providers (i.e., "doctor shopping") or those physicians overprescribing opioids. Most US health care workers support the idea of PMPs, which intend to assist physicians, physician assistants, nurse practitioners, dentists and other prescribers, the pharmacists, chemists and support staff of dispensing establishments. The database, whose use is required by State law, typically requires prescribers and pharmacies dispensing controlled substances to register with their respective state PMPs and (for pharmacies and providers who dispense from their offices) to report the dispensation of such prescriptions to an electronic online database. The majority of PMPs are authorized to notify law enforcement agencies or licensing boards or physicians when a prescriber, or patients receiving prescriptions, exceed thresholds established by the state or prescription recipient exceeds thresholds established by the State. All states have implemented PDMPs, although evidence for the effectiveness of these programs is mixed. While prescription of opioids has decreased with PMP use, overdose deaths in many states have actually increased, with those states sharing data with neighboring jurisdictions or requiring reporting of more drugs experiencing highest increases in deaths. This may be because those declined opioid prescriptions turn to street drugs, whose potency and contaminants carry greater overdose risk. == History == Prescription drug monitoring programs, or PDMPs, are an example of one initiative proposed to alleviate effects of the opioid crisis. The programs are designed to restrict prescription drug abuse by limiting a patient's ability to obtain similar prescriptions from multiple providers (i.e. “doctor shopping”) and reducing diversion of controlled substances. This is meant to reduce risk of fatal overdose caused by high doses of opioids or interactions between opioids and benzodiazepenes, and to enable better decision making on the part of healthcare providers who may be unaware of a patient's prescription drug use, history or other prescriptions. PDMPs have been implemented in state legislations since 1939 in California, a time before electronic medical records, though implementation rose alongside increased awareness of overprescribing of opioids and overdose. A later New York state program was struck down by the U.S. Supreme Court in Whalen v. Roe. But, by 2019, 49 states, the District of Columbia, and Guam had enacted PDMP legislation. In 2021 Missouri, the last State to not use a PMP, adopted legislation to create one. PMPs are constantly being updated to increase speed of data collection, sharing of data across States, and ease of interpretation. This is being done by integrating PDMP reports with other health information technologies such as health information exchanges (HIE), electronic health record (EHR) systems, and/ or pharmacy dispensing software systems. One program that has been implemented in nine states is called the PDMP Electronic Health Records Integration and Interoperability Expansion, also known as PEHRIIE. Another software, marketed by Bamboo Health and integrated with PMPs in 43 states, uses an algorithm to track factors thought to increase risk of diversion, abuse or overdose, and assigns patients a three digit score based on presumed indicators of risk. While some studies have suggested that PDMP-HIT integration and sharing of interstate data brings benefits such as reduced opioid-related inpatient morbidity, others have found no or negative impact on mortality compared to states without PMP data sharing. Patient and media reports suggest need for testing and evaluation of algorithmic software used to score risk, with some patients reporting denial of prescriptions without c explanation or clarity of data. == Goals == Most health care workers support PMPs which intend to assist physicians, physician assistants, nurse practitioners, dentists and other prescribers, the pharmacists, chemists and support staff of dispensing establishments, as well as law-enforcement agencies. The collaboration supports the legitimate medical use of controlled substances while limiting their abuse and diversion. Pharmacies dispensing controlled substances and prescribers typically must register with their respective state PMPs and (for pharmacies and providers who dispense controlled substances from their offices) report the dispensation to an electronic online database. Some pharmacy software can submit these reports automatically to multiple states. == Usage == === List of programs by state === === Software systems === NarxCare is a prescription drug monitoring program (PDMP) run by Bamboo Health. Bamboo Health was formerly known as Appriss. It is widely used across the United States by pharmacies including Rite Aid as well as those at Walmart and Sam’s Club. The NarxCare software allows doctors to view data about a patient, combining data from the prescription registries of various U.S. states to make the registries interoperable nationally. It also uses machine learning to generate an "Overdose Risk Score" that potentially includes EMS and criminal justice data; these scores have been criticized by researchers and patient advocates for the lack of transparency in the process as well as the potential for disparate treatment of women and minority groups. Advertised as an "analytics tool and care management platform", the NarxCare software allows doctors to view data about a patient including how many pharmacies they have visited and the combinations of medication they are prescribed. It combines data from the prescription registries of various U.S. states, making the registries interoperable nationally. It additionally uses machine learning to generate various three-digit "risk scores" and an overall "Overdose Risk Score", collectively referred to as Narx Scores, in a process that potentially includes EMS and criminal justice data as well as court records. == Controversy == Many doctors and researchers support the idea of PDMPs as a tool in combatting the opioid epidemic. Opioid prescribing, opioid diversion and supply, opioid misuse, and opioid-related morbidity and mortality are common elements in data entered into PDMPs. Prescription Monitoring Programs are purported to offer economic benefits for the states who implement them by decreasing overall health care costs, lost productivity, and investigation times. However, there are many studies that conclude the impact of PDMPs is unclear. While use of PMPs has been accompanied by decrease in opioid prescribing, few analyses consider corresponding use of street opioids, extramedical use, or diversion, which might provide a more holistic method for evaluation of PMP intent and efficacy. Evidence for PDMP impact on fatal overdoses is decidedly mixed, with multiple studies finding increased overdose rates in some states, decreases in others, or no clear impact. Interestingly, an increase in heroin overdoses after PDMP implementation has been commonly reported, presumably as denial of prescription opioids sends patients in search of street drugs. Narx Scores have been criticized by researchers and patient advocates for the lack of transparency in the generation process as well as the potential for disparate treatment of women and minority groups. Writing in Duke Law Journal, Jennifer Oliva stated that "black-box algorithms" are used to generate the scores.

    Read more →
  • Autoencoder

    Autoencoder

    An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-

    Read more →
  • Sigmoid function

    Sigmoid function

    A sigmoid function is any mathematical function whose graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function. Other sigmoid functions are given in the Examples section. In some fields, most notably in the context of artificial neural networks, the term "sigmoid function" is used as a synonym for "logistic function". Special cases of sigmoid functions include the Gompertz curve (used in modeling systems that saturate at large values of x) and the ogee curve (used in the spillway of some dams). Sigmoid functions have domain of all real numbers, with return (response) value commonly monotonically increasing but could be decreasing. Sigmoid functions most often show a return value (y axis) in the range 0 to 1. Another commonly used range is from −1 to 1. There is also the Heaviside step function, which instantaneously transitions between 0 and 1. A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons. Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic density, the normal density, and Student's t probability density functions. The logistic sigmoid function is invertible, and its inverse is the logit function. == Theory == In mathematics, a unitary sigmoid function is a bounded sigmoid-type function normalized to the unit range, typically with lower and upper asymptotes at 0 and 1. The theory proposed by Grebenc distinguishes three kinds of unitary sigmoid functions according to their asymptotic behavior and the presence or absence of oscillation near the asymptotes. A general form of a unitary sigmoid function is y = A S ( f ( x ) ) + B , {\displaystyle y=A\,S(f(x))+B,} where S {\displaystyle S} is an increasing sigmoid function, f ( x ) {\displaystyle f(x)} is a transformation of the independent variable, and A {\displaystyle A} and B {\displaystyle B} are constants controlling scaling and translation. === Classification === ==== 1st kind ==== A unitary sigmoid function of the first kind is a bounded increasing function that approaches its lower and upper asymptotes monotonically, without oscillation. This class includes many of the standard sigmoid functions used in statistics, biomathematics, and engineering, such as the logistic function and related generalizations. ==== 2nd kind ==== A unitary sigmoid function of the second kind is a bounded increasing function that oscillates near the upper asymptote while preserving an overall sigmoid transition. ==== 3rd kind ==== A unitary sigmoid function of the third kind is a bounded increasing function that oscillates near both the lower and upper asymptotes. These functions retain the global shape of a sigmoid curve but exhibit oscillatory behavior in the vicinity of both limiting states. === Taxonomy === The tables below show the taxonomy of unitary sigmoid functions of all three kinds. Table 1. Taxonomy matrix with examples of sigmoid functions of the 1st kind Table 2. Taxonomy matrix with examples of sigmoid functions of the 2nd kind on the unbounded interval Table 3. Taxonomy matrix with examples of sigmoid functions of the 3rd kind === Construction methods === The same theory presents a list of 30 methods for constructing sigmoid functions.. These include algebraic transformations, integration and convolution methods, constructions from bell-shaped functions, solutions of ordinary and partial differential equations, recursive schemes, stochastic differential equations, feedback systems, and chaotic systems. M0: Construction method for sigmoid functions not evident or intuitive M1: Inverse of singularity functions M2: Sigmoid functions of embedded positive functions M3: Rising a sigmoid function to the power M4: Exponentiating a sigmoid function M5: Symmetric sigmoid functions derived from asymmetric ones M6: Sigmoid functions of the reciprocal independent variable M7: Embedding a sigmoid function into other function M8: Sum of sigmoid functions M9: Multiplication of sigmoid functions M10: Integral of the product of an increasing and a decreasing function M11: Derivation from lambda (bell-shaped) functions M12: Integration of lambda (bell-shaped) function M13: Integration of the sum of lambda (bell-shaped) functions M14: Integration of the product of two lambda (bell-shaped) functions M15: Integration of the difference of two shifted sigmoid functions M16: Integration of the product of two shifted sigmoid functions M17: Convolution of sigmoid functions M18: Integration of the product of lambda and sigmoid function M19: Solutions of ordinary differential equations M20: Solutions of partial differential equation (PDE) M21: Solutions of functional differential equation (FDE) M22: Sum of a sigmoid function and some derivatives M23: Combination of sigmoid functions, its derivative and integral M24: Filtering sigmoid functions M25: Special cases of Gauss hypergeometric functions M26: Feedback closed-loop systems M27: Recursive functions M28: Recursive time-delayed feed-forward loops M29: Solutions of stochastic differential equation M30: Chaotic sigmoid functions Consult reference for more details. == Definition == A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a positive derivative at each point. == Properties == In general, a sigmoid function is monotonic, and has a first derivative which is bell shaped. Conversely, the integral of any continuous, non-negative, bell-shaped function (with one local maximum and no local minimum, unless degenerate) will be sigmoidal. Thus the cumulative distribution functions for many common probability distributions are sigmoidal. One such example is the error function, which is related to the cumulative distribution function of a normal distribution; another is the arctan function, which is related to the cumulative distribution function of a Cauchy distribution. A sigmoid function is constrained by a pair of horizontal asymptotes as x → ± ∞ {\displaystyle x\rightarrow \pm \infty } . A sigmoid function is convex for values less than a particular point, and it is concave for values greater than that point: in many of the examples here, that point is 0. == Examples == Logistic function f ( x ) = 1 1 + e − x {\displaystyle f(x)={\frac {1}{1+e^{-x}}}} Hyperbolic tangent (shifted and scaled version of the logistic function, above) f ( x ) = tanh ⁡ x = e x − e − x e x + e − x {\displaystyle f(x)=\tanh x={\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}} Arctangent function f ( x ) = arctan ⁡ x {\displaystyle f(x)=\arctan x} Gudermannian function f ( x ) = gd ⁡ ( x ) = ∫ 0 x d t cosh ⁡ t = 2 arctan ⁡ ( tanh ⁡ ( x 2 ) ) {\displaystyle f(x)=\operatorname {gd} (x)=\int _{0}^{x}{\frac {dt}{\cosh t}}=2\arctan \left(\tanh \left({\frac {x}{2}}\right)\right)} Error function f ( x ) = erf ⁡ ( x ) = 2 π ∫ 0 x e − t 2 d t {\displaystyle f(x)=\operatorname {erf} (x)={\frac {2}{\sqrt {\pi }}}\int _{0}^{x}e^{-t^{2}}\,dt} Generalised logistic function f ( x ) = ( 1 + e − x ) − α , α > 0 {\displaystyle f(x)=\left(1+e^{-x}\right)^{-\alpha },\quad \alpha >0} Smoothstep function f ( x ) = { ( ∫ 0 1 ( 1 − u 2 ) N d u ) − 1 ∫ 0 x ( 1 − u 2 ) N d u , | x | ≤ 1 sgn ⁡ ( x ) | x | ≥ 1 N ∈ Z ≥ 1 {\displaystyle f(x)={\begin{cases}{\displaystyle \left(\int _{0}^{1}\left(1-u^{2}\right)^{N}du\right)^{-1}\int _{0}^{x}\left(1-u^{2}\right)^{N}\ du},&|x|\leq 1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\quad N\in \mathbb {Z} \geq 1} Some algebraic functions, for example f ( x ) = x 1 + x 2 {\displaystyle f(x)={\frac {x}{\sqrt {1+x^{2}}}}} and in a more general form f ( x ) = x ( 1 + | x | k ) 1 / k {\displaystyle f(x)={\frac {x}{\left(1+|x|^{k}\right)^{1/k}}}} Up to shifts and scaling, many sigmoids are special cases of f ( x ) = φ ( φ ( x , β ) , α ) , {\displaystyle f(x)=\varphi (\varphi (x,\beta ),\alpha ),} where φ ( x , λ ) = { ( 1 − λ x ) 1 / λ λ ≠ 0 e − x λ = 0 {\displaystyle \varphi (x,\lambda )={\begin{cases}(1-\lambda x)^{1/\lambda }&\lambda \neq 0\\e^{-x}&\lambda =0\\\end{cases}}} is the inverse of the negative Box–Cox transformation, and α < 1 {\displaystyle \alpha <1} and β < 1 {\displaystyle \beta <1} are shape parameters. Smooth transition function normalized to (−1,1): f ( x ) = { 2 1 + e − 2 m x 1 − x 2 − 1 , | x | < 1 sgn ⁡ ( x ) | x | ≥ 1 = { tanh ⁡ ( m x 1 − x 2 ) , | x | < 1 sgn ⁡ ( x ) | x | ≥ 1 {\displaystyle {\begin{aligned}f(x)&={\begin{cases}{\displaystyle {\frac {2}{1+e^{-2m{\frac {x}{1-x^{2}}}}}}-1},&|x|<1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\\&={\begin{cases}{\displaystyle \tanh \left(m{\frac {x}{1-x^{2}}}\right)},&|x|<1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\end{aligned}}} using the hyperbolic tangent mentioned above. Here, m {\displaystyle m} is a free parameter encoding the slope at x = 0 {\displaystyle x=0} , which must be great

    Read more →
  • Single particle analysis

    Single particle analysis

    Single particle analysis is a group of related computerized image processing techniques used to analyze images from transmission electron microscopy (TEM). These methods were developed to improve and extend the information obtainable from TEM images of particulate samples, typically proteins or other large biological entities such as viruses. Individual images of stained or unstained particles are very noisy, making interpretation difficult. Combining several digitized images of similar particles together gives an image with stronger and more easily interpretable features. An extension of this technique uses single particle methods to build up a three-dimensional reconstruction of the particle. Using cryo-electron microscopy it has become possible to generate reconstructions with sub-nanometer, near-atomic resolution resolution first in the case of highly symmetric viruses, and now in smaller, asymmetric proteins as well. == Techniques == Single particle analysis can be done on both negatively stained and vitreous ice-embedded transmission electron cryomicroscopy (CryoTEM) samples. Single particle analysis methods are, in general, reliant on the sample being homogeneous, although techniques for dealing with conformational heterogeneity are being developed. Images (micrographs) are taken with an electron microscope using charged-coupled device (CCD) detectors coupled to a phosphorescent layer (in the past, they were instead collected on film and digitized using high-quality scanners). The image processing is carried out using specialized software programs, often run on multi-processor computer clusters. Depending on the sample or the desired results, various steps of two- or three-dimensional processing can be done. === Alignment and classification === Biological samples, and especially samples embedded in thin vitreous ice, are highly radiation sensitive, thus only low electron doses can be used to image the sample. This low dose, as well as variations in the metal stain used (if used) means images have high noise relative to the signal given by the particle being observed. By aligning several similar images to each other so they are in register and then averaging them, an image with higher signal-to-noise ratio can be obtained. As the noise is mostly randomly distributed and the underlying image features constant, by averaging the intensity of each pixel over several images only the constant features are reinforced. Typically, the optimal alignment (a translation and an in-plane rotation) to map one image onto another is calculated by cross-correlation. However, a micrograph often contains particles in multiple different orientations and/or conformations, and so to get more representative image averages, a method is required to group similar particle images together into multiple sets. This is normally carried out using one of several data analysis and image classification algorithms, such as multi-variate statistical analysis and hierarchical ascendant classification, or k-means clustering. Often data sets of tens of thousands of particle images are used, and to reach an optimal solution an iterative procedure of alignment and classification is used, whereby strong image averages produced by classification are used as reference images for a subsequent alignment of the whole data set. === Image filtering === Image filtering (band-pass filtering) is often used to reduce the influence of high and/or low spatial frequency information in the images, which can affect the results of the alignment and classification procedures. This is particularly useful in negative stain images. The algorithms make use of fast Fourier transforms (FFT), often employing Gaussian shaped soft-edged masks in reciprocal space to suppress certain frequency ranges. High-pass filters remove low spatial frequencies (such as ramp or gradient effects), leaving the higher frequencies intact. Low-pass filters remove high spatial frequency features and have a blurring effect on fine details. === Contrast transfer function === Due to the nature of image formation in the electron microscope, bright-field TEM images are obtained using significant underfocus. This, along with features inherent in the microscope's lens system, creates blurring of the collected images visible as a point spread function. The combined effects of the imaging conditions are known as the contrast transfer function (CTF), and can be approximated mathematically as a function in reciprocal space. Specialized image processing techniques such as phase flipping and amplitude correction / Wiener filtering can (at least partially) correct for the CTF, and allow high resolution reconstructions. === Three-dimensional reconstruction === Transmission electron microscopy images are projections of the object showing the distribution of density through the object, similar to medical X-rays. By making use of the projection-slice theorem a three-dimensional reconstruction of the object can be generated by combining many images (2D projections) of the object taken from a range of viewing angles. Proteins in vitreous ice ideally adopt a random distribution of orientations (or viewing angles), allowing a fairly isotropic reconstruction if a large number of particle images are used. This contrasts with electron tomography, where the viewing angles are limited due to the geometry of the sample/imaging set up, giving an anisotropic reconstruction. Filtered back projection is a commonly used method of generating 3D reconstructions in single particle analysis, although many alternative algorithms exist. Before a reconstruction can be made, the orientation of the object in each image needs to be estimated. Several methods have been developed to work out the relative Euler angles of each image. Some are based on common lines (common 1D projections and sinograms), others use iterative projection matching algorithms. The latter works by beginning with a simple, low resolution 3D starting model and compares the experimental images to projections of the model and creates a new 3D to bootstrap towards a solution. Methods are also available for making 3D reconstructions of helical samples (such as tobacco mosaic virus), taking advantage of the inherent helical symmetry. Both real space methods (treating sections of the helix as single particles) and reciprocal space methods (using diffraction patterns) can be used for these samples. === Tilt methods === The specimen stage of the microscope can be tilted (typically along a single axis), allowing the single particle technique known as random conical tilt. An area of the specimen is imaged at both zero and at high angle (~60-70 degrees) tilts, or in the case of the related method of orthogonal tilt reconstruction, +45 and −45 degrees. Pairs of particles corresponding to the same object at two different tilts (tilt pairs) are selected, and by following the parameters used in subsequent alignment and classification steps a three-dimensional reconstruction can be generated relatively easily. This is because the viewing angle (defined as three Euler angles) of each particle is known from the tilt geometry. 3D reconstructions from random conical tilt suffer from missing information resulting from a restricted range of orientations. Known as the missing cone (due to the shape in reciprocal space), this causes distortions in the 3D maps. However, the missing cone problem can often be overcome by combining several tilt reconstructions. Tilt methods are best suited to negatively stained samples, and can be used for particles that adsorb to the carbon support film in preferred orientations. The phenomenon known as charging or beam-induced movement makes collecting high-tilt images of samples in vitreous ice challenging. === Map visualization and fitting === Various software programs are available that allow viewing the 3D maps. These often enable the user to manually dock in protein coordinates (structures from X-ray crystallography, NMR, or a computational model such as one found in the AlphaFold Protein Structure Database) of subunits into the electron density. Several programs can also fit subunits computationally; as of the 2020s using these programs tend to produce better accuracy than manual docking because they can perform labor-intensive tasks such as: The scale of SPA-derived maps depends on knowing the pixel size (angstorms per pixel), which is not always accurate. Programs can automatically correct for this difference by using coordinate data or by using knowledge of chemical bonds. Many proteins are made up of several roughly rigid protein domains linked by flexible parts. Pre-existing coordinate data, whether experimental or computational, may not exactly match the inter-domain positioning of the cyro-EM map. Modern programs can automatically "chop" pre-existing coordinate data into individual domains and fit them in individually. For higher-resolution structures, it is pos

    Read more →
  • LPBoost

    LPBoost

    Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes, and thus also belongs to the class of margin classifier algorithms. Consider a classification function f : X → { − 1 , 1 } , {\displaystyle f:{\mathcal {X}}\to \{-1,1\},} which classifies samples from a space X {\displaystyle {\mathcal {X}}} into one of two classes, labelled 1 and -1, respectively. LPBoost is an algorithm for learning such a classification function, given a set of training examples with known class labels. LPBoost is a machine learning technique especially suited for joint classification and feature selection in structured domains. == LPBoost overview == As in all boosting classifiers, the final classification function is of the form f ( x ) = ∑ j = 1 J α j h j ( x ) , {\displaystyle f({\boldsymbol {x}})=\sum _{j=1}^{J}\alpha _{j}h_{j}({\boldsymbol {x}}),} where α j {\displaystyle \alpha _{j}} are non-negative weightings for weak classifiers h j : X → { − 1 , 1 } {\displaystyle h_{j}:{\mathcal {X}}\to \{-1,1\}} . Each individual weak classifier h j {\displaystyle h_{j}} may be just a little bit better than random, but the resulting linear combination of many weak classifiers can perform very well. LPBoost constructs f {\displaystyle f} by starting with an empty set of weak classifiers. Iteratively, a single weak classifier to add to the set of considered weak classifiers is selected, added and all the weights α {\displaystyle {\boldsymbol {\alpha }}} for the current set of weak classifiers are adjusted. This is repeated until no weak classifiers to add remain. The property that all classifier weights are adjusted in each iteration is known as totally-corrective property. Early boosting methods, such as AdaBoost do not have this property and converge slower. == Linear program == More generally, let H = { h ( ⋅ ; ω ) | ω ∈ Ω } {\displaystyle {\mathcal {H}}=\{h(\cdot ;\omega )|\omega \in \Omega \}} be the possibly infinite set of weak classifiers, also termed hypotheses. One way to write down the problem LPBoost solves is as a linear program with infinitely many variables. The primal linear program of LPBoost, optimizing over the non-negative weight vector α {\displaystyle {\boldsymbol {\alpha }}} , the non-negative vector ξ {\displaystyle {\boldsymbol {\xi }}} of slack variables and the margin ρ {\displaystyle \rho } is the following. min α , ξ , ρ − ρ + D ∑ n = 1 ℓ ξ n sb.t. ∑ ω ∈ Ω y n α ω h ( x n ; ω ) + ξ n ≥ ρ , n = 1 , … , ℓ , ∑ ω ∈ Ω α ω = 1 , ξ n ≥ 0 , n = 1 , … , ℓ , α ω ≥ 0 , ω ∈ Ω , ρ ∈ R . {\displaystyle {\begin{array}{cl}{\underset {{\boldsymbol {\alpha }},{\boldsymbol {\xi }},\rho }{\min }}&-\rho +D\sum _{n=1}^{\ell }\xi _{n}\\{\textrm {sb.t.}}&\sum _{\omega \in \Omega }y_{n}\alpha _{\omega }h({\boldsymbol {x}}_{n};\omega )+\xi _{n}\geq \rho ,\qquad n=1,\dots ,\ell ,\\&\sum _{\omega \in \Omega }\alpha _{\omega }=1,\\&\xi _{n}\geq 0,\qquad n=1,\dots ,\ell ,\\&\alpha _{\omega }\geq 0,\qquad \omega \in \Omega ,\\&\rho \in {\mathbb {R} }.\end{array}}} Note the effects of slack variables ξ ≥ 0 {\displaystyle {\boldsymbol {\xi }}\geq 0} : their one-norm is penalized in the objective function by a constant factor D {\displaystyle D} , which—if small enough—always leads to a primal feasible linear program. Here we adopted the notation of a parameter space Ω {\displaystyle \Omega } , such that for a choice ω ∈ Ω {\displaystyle \omega \in \Omega } the weak classifier h ( ⋅ ; ω ) : X → { − 1 , 1 } {\displaystyle h(\cdot ;\omega ):{\mathcal {X}}\to \{-1,1\}} is uniquely defined. When the above linear program was first written down in early publications about boosting methods it was disregarded as intractable due to the large number of variables α {\displaystyle {\boldsymbol {\alpha }}} . Only later it was discovered that such linear programs can indeed be solved efficiently using the classic technique of column generation. === Column generation for LPBoost === In a linear program a column corresponds to a primal variable. Column generation is a technique to solve large linear programs. It typically works in a restricted problem, dealing only with a subset of variables. By generating primal variables iteratively and on-demand, eventually the original unrestricted problem with all variables is recovered. By cleverly choosing the columns to generate the problem can be solved such that while still guaranteeing the obtained solution to be optimal for the original full problem, only a small fraction of columns has to be created. ==== LPBoost dual problem ==== Columns in the primal linear program corresponds to rows in the dual linear program. The equivalent dual linear program of LPBoost is the following linear program. max λ , γ γ sb.t. ∑ n = 1 ℓ y n h ( x n ; ω ) λ n + γ ≤ 0 , ω ∈ Ω , 0 ≤ λ n ≤ D , n = 1 , … , ℓ , ∑ n = 1 ℓ λ n = 1 , γ ∈ R . {\displaystyle {\begin{array}{cl}{\underset {{\boldsymbol {\lambda }},\gamma }{\max }}&\gamma \\{\textrm {sb.t.}}&\sum _{n=1}^{\ell }y_{n}h({\boldsymbol {x}}_{n};\omega )\lambda _{n}+\gamma \leq 0,\qquad \omega \in \Omega ,\\&0\leq \lambda _{n}\leq D,\qquad n=1,\dots ,\ell ,\\&\sum _{n=1}^{\ell }\lambda _{n}=1,\\&\gamma \in \mathbb {R} .\end{array}}} For linear programs the optimal value of the primal and dual problem are equal. For the above primal and dual problems, the optimal value is equal to the negative 'soft margin'. The soft margin is the size of the margin separating positive from negative training instances minus positive slack variables that carry penalties for margin-violating samples. Thus, the soft margin may be positive although not all samples are linearly separated by the classification function. The latter is called the 'hard margin' or 'realized margin'. ==== Convergence criterion ==== Consider a subset of the satisfied constraints in the dual problem. For any finite subset we can solve the linear program and thus satisfy all constraints. If we could prove that of all the constraints which we did not add to the dual problem no single constraint is violated, we would have proven that solving our restricted problem is equivalent to solving the original problem. More formally, let γ ∗ {\displaystyle \gamma ^{}} be the optimal objective function value for any restricted instance. Then, we can formulate a search problem for the 'most violated constraint' in the original problem space, namely finding ω ∗ ∈ Ω {\displaystyle \omega ^{}\in \Omega } as ω ∗ = argmax ω ∈ Ω ∑ n = 1 ℓ y n h ( x n ; ω ) λ n . {\displaystyle \omega ^{}={\underset {\omega \in \Omega }{\textrm {argmax}}}\sum _{n=1}^{\ell }y_{n}h({\boldsymbol {x}}_{n};\omega )\lambda _{n}.} That is, we search the space H {\displaystyle {\mathcal {H}}} for a single decision stump h ( ⋅ ; ω ∗ ) {\displaystyle h(\cdot ;\omega ^{})} maximizing the left hand side of the dual constraint. If the constraint cannot be violated by any choice of decision stump, none of the corresponding constraint can be active in the original problem and the restricted problem is equivalent. ==== Penalization constant ==== D {\displaystyle D} The positive value of penalization constant D {\displaystyle D} has to be found using model selection techniques. However, if we choose D = 1 ℓ ν {\displaystyle D={\frac {1}{\ell \nu }}} , where ℓ {\displaystyle \ell } is the number of training samples and 0 < ν < 1 {\displaystyle 0<\nu <1} , then the new parameter ν {\displaystyle \nu } has the following properties. ν {\displaystyle \nu } is an upper bound on the fraction of training errors; that is, if k {\displaystyle k} denotes the number of misclassified training samples, then k ℓ ≤ ν {\displaystyle {\frac {k}{\ell }}\leq \nu } . ν {\displaystyle \nu } is a lower bound on the fraction of training samples outside or on the margin. == Algorithm == Input: Training set X = { x 1 , … , x ℓ } {\displaystyle X=\{{\boldsymbol {x}}_{1},\dots ,{\boldsymbol {x}}_{\ell }\}} , x i ∈ X {\displaystyle {\boldsymbol {x}}_{i}\in {\mathcal {X}}} Training labels Y = { y 1 , … , y ℓ } {\displaystyle Y=\{y_{1},\dots ,y_{\ell }\}} , y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} Convergence threshold θ ≥ 0 {\displaystyle \theta \geq 0} Output: Classification function f : X → { − 1 , 1 } {\displaystyle f:{\mathcal {X}}\to \{-1,1\}} Initialization Weights, uniform λ n ← 1 ℓ , n = 1 , … , ℓ {\displaystyle \lambda _{n}\leftarrow {\frac {1}{\ell }},\quad n=1,\dots ,\ell } Edge γ ← 0 {\displaystyle \gamma \leftarrow 0} Hypothesis count J ← 1 {\displaystyle J\leftarrow 1} Iterate h ^ ← argmax ω ∈ Ω ∑ n = 1 ℓ y n h ( x n ; ω ) λ n {\displaystyle {\hat {h}}\leftarrow {\underset {\omega \in \Omega }{\textrm {argmax}}}\sum _{n=1}^{\ell }y_{n}h({\boldsymbol {x}}_{n};\omega )\lambda _{n}} if ∑ n = 1 ℓ y n h ^ ( x n ) λ n + γ ≤ θ {\displaystyle \sum _{n=1}^{\ell }y_{n}{\hat {h}}({\boldsymbol {x}}_{n})\lambda _{n}+\gamma \leq \theta } then break h J ← h ^ {\displaystyle h_{J}\leftarrow {\hat {h}}} J

    Read more →
  • Quickprop

    Quickprop

    Quickprop is an iterative method for determining the minimum of the loss function of an artificial neural network, following an algorithm inspired by the Newton's method. Sometimes, the algorithm is classified to the group of the second order learning methods. It follows a quadratic approximation of the previous gradient step and the current gradient, which is expected to be close to the minimum of the loss function, under the assumption that the loss function is locally approximately square, trying to describe it by means of an upwardly open parabola. The minimum is sought in the vertex of the parabola. The procedure requires only local information of the artificial neuron to which it is applied. The k {\displaystyle k} -th approximation step is given by: Δ ( k ) w i j = Δ ( k − 1 ) w i j ( ∇ i j E ( k ) ∇ i j E ( k − 1 ) − ∇ i j E ( k ) ) {\displaystyle \Delta ^{(k)}\,w_{ij}=\Delta ^{(k-1)}\,w_{ij}\left({\frac {\nabla _{ij}\,E^{(k)}}{\nabla _{ij}\,E^{(k-1)}-\nabla _{ij}\,E^{(k)}}}\right)} Where w i j {\displaystyle w_{ij}} is the weight of input i {\displaystyle i} of neuron j {\displaystyle j} , and E {\displaystyle E} is the loss function. The Quickprop algorithm is an implementation of the error backpropagation algorithm, but the network can behave chaotically during the learning phase due to large step sizes.

    Read more →
  • Population model (evolutionary algorithm)

    Population model (evolutionary algorithm)

    The population model of an evolutionary algorithm (EA) describes the structural properties of its population to which its members are subject. A population is the set of all proposed solutions of an EA considered in one iteration, which are also called individuals according to the biological role model. The individuals of a population can generate further individuals as offspring with the help of the genetic operators of the procedure. The simplest and widely used population model in EAs is the global or panmictic model, which corresponds to an unstructured population. It allows each individual to choose any other individual of the population as a partner for the production of offspring by crossover, whereby the details of the selection are irrelevant as long as the fitness of the individuals plays a significant role. Due to global mate selection, the genetic information of even slightly better individuals can prevail in a population after a few generations (iteration of an EA), provided that no better other offspring have emerged in this phase. If the solution found in this way is not the optimum sought, that is called premature convergence. This effect can be observed more often in panmictic populations. In nature global mating pools are rarely found. What prevails is a certain and limited isolation due to spatial distance. The resulting local neighbourhoods initially evolve independently and mutants have a higher chance of persisting over several generations. As a result, genotypic diversity in the gene pool is preserved longer than in a panmictic population. It is therefore obvious to divide the previously global population by substructures. Two basic models were introduced for this purpose, the island models, which are based on a division of the population into fixed subpopulations that exchange individuals from time to time, and the neighbourhood models, which assign individuals to overlapping neighbourhoods, also known as cellular genetic or evolutionary algorithms (cGA or cEA). The associated division of the population also suggests a corresponding parallelization of the procedure. For this reason, the topic of population models is also frequently discussed in the literature in connection with the parallelization of EAs. == Island models == In the island model, also called the migration model or coarse grained model, evolution takes place in strictly divided subpopulations. These can be organised panmictically, but do not have to be. From time to time an exchange of individuals takes place, which is called migration. The time between an exchange is called an epoch and its end can be triggered by various criteria: E.g. after a given time or given number of completed generations, or after the occurrence of stagnation. Stagnation can be detected, for example, by the fact that no fitness improvement has occurred in the island for a given number of generations. Island models introduce a variety of new strategy parameters: Number of subpopulations Size of the subpopulations Neighbourhood relations between islands: they determine which islands are considered neighbouring and can thus exchange individuals, see picture of a simple unidirectional ring (black arrows) and its extension by additional bidirectional neighbourhood relations (additional green arrows) Criteria for the termination of an epoch, synchronous or asynchronous migration Migration rate: number or proportion of individuals involved in migration. Migrant selection: There are many alternatives for this. E.g. the best individuals can replace the worst or randomly selected ones. Depending on the migration rate, this can affect one or more individuals at a time. With these parameters, the selection pressure can be influenced to a considerable extent. For example, it increases with the interconnectedness of the islands and decreases with the number of subpopulations or the epoch length. == Neighbourhood models or cellular evolutionary algorithms == The neighbourhood model, also called diffusion model or fine grained model, defines a topological neighbouhood relation between the individuals of a population that is independent of their phenotypic properties. The fundamental idea of this model is to provide the EA population with a special structure defined as a connected graph, in which each vertex is an individual that communicates with its nearest neighbours. Particularly, individuals are conceptually set in a toroidal mesh, and are only allowed to recombine with close individuals. This leads to a kind of locality known as isolation by distance. The set of potential mates of an individual is called its neighbourhood or deme. The adjacent figure illustrates that by showing two slightly overlapping neighbourhoods of two individuals marked yellow, through which genetic information can spread between the two demes. It is known that in this kind of algorithm, similar individuals tend to cluster and create niches that are independent of the deme boundaries and, in particular, can be larger than a deme. There is no clear borderline between adjacent groups, and close niches could be easily colonized by competitive ones and maybe merge solution contents during this process. Simultaneously, farther niches can be affected more slowly. EAs with this type of population are also well known as cellular EAs (cEA) or cellular genetic algorithms (cGA). A commonly used structure for arranging the individuals of a population is a 2D toroidal grid, although the number of dimensions can be easily extended (to 3D) or reduced (to 1D, e.g. a ring, see the figure on the right). The neighbourhood of a particular individual in the grid is defined in terms of the Manhattan distance from it to others in the population. In the basic algorithm, all the neighbourhoods have the same size and identical shapes. The two most commonly used neighbourhoods for two-dimensional cEAs are L5 and C9, see the figure on the left. Here, L stands for Linear while C stands for Compact. Each deme represents a panmictic subpopulation within which mate selection and the acceptance of offspring takes place by replacing the parent. The rules for the acceptance of offspring are local in nature and based on the neighbourhood: for example, it can be specified that the best offspring must be better than the parent being replaced or, less strictly, only better than the worst individual in the deme. The first rule is elitist and creates a higher selective pressure than the second non-elitist rule. In elitist EAs, the best individual of a population always survives. In this respect, they deviate from the biological model. The overlap of the neighbourhoods causes a mostly slow spread of genetic information across the neighbourhood boundaries, hence the name diffusion model. A better offspring now needs more generations than in panmixy to spread in the population. This promotes the emergence of local niches and their local evolution, thus preserving genotypic diversity over a longer period of time. The result is a better and dynamic balance between breadth and depth search adapted to the search space during a run. Depth search takes place in the niches and breadth search in the niche boundaries and through the evolution of the different niches of the whole population. For the same neighbourhood size, the spread of genetic information is larger for elongated figures like L9 than for a block like C9, and again significantly larger than for a ring. This means that ring neighbourhoods are well suited for achieving high quality results, even if this requires comparatively long run times. On the other hand, if one is primarily interested in fast and good, but possibly suboptimal results, 2D topologies are more suitable. == Comparison == When applying both population models to genetic algorithms, evolutionary strategy and other EAs, the splitting of a total population into subpopulations usually reduces the risk of premature convergence and leads to better results overall more reliably and faster than would be expected with panmictic EAs. Island models have the disadvantage compared to neighbourhood models that they introduce a large number of new strategy parameters. Despite the existing studies on this topic in the literature, a certain risk of unfavourable settings remains for the user. With neighbourhood models, on the other hand, only the size of the neighbourhood has to be specified and, in the case of the two-dimensional model, the choice of the neighbourhood figure is added. == Parallelism == Since both population models imply population partitioning, they are well suited as a basis for parallelizing an EA. This applies even more to cellular EAs, since they rely only on locally available information about the members of their respective demes. Thus, in the extreme case, an independent execution thread can be assigned to each individual, so that the entire cEA can run on a parallel hardware platform. The island model also supports p

    Read more →
  • Decorrelation

    Decorrelation

    Decorrelation is a general term for any process that is used to reduce autocorrelation within a signal, or cross-correlation within a set of signals, while preserving other aspects of the signal. A frequently used method of decorrelation is the use of a matched linear filter to reduce the autocorrelation of a signal as far as possible. Since the minimum possible autocorrelation for a given signal energy is achieved by equalising the power spectrum of the signal to be similar to that of a white noise signal, this is often referred to as signal whitening. == Process == === Signal processing === Most decorrelation algorithms are linear, but there are also non-linear decorrelation algorithms. Many data compression algorithms incorporate a decorrelation stage. For example, many transform coders first apply a fixed linear transformation that would, on average, have the effect of decorrelating a typical signal of the class to be coded, prior to any later processing. This is typically a Karhunen–Loève transform, or a simplified approximation such as the discrete cosine transform. By comparison, sub-band coders do not generally have an explicit decorrelation step, but instead exploit the already-existing reduced correlation within each of the sub-bands of the signal, due to the relative flatness of each sub-band of the power spectrum in many classes of signals. Linear predictive coders can be modelled as an attempt to decorrelate signals by subtracting the best possible linear prediction from the input signal, leaving a whitened residual signal. Decorrelation techniques can also be used for many other purposes, such as reducing crosstalk in a multi-channel signal, or in the design of echo cancellers. In image processing decorrelation techniques can be used to enhance or stretch, colour differences found in each pixel of an image. This is generally termed as 'decorrelation stretching'. === Neuroscience === In neuroscience, decorrelation is used in the analysis of the neural networks in the human visual system. The raw inputs from cone cells and rod cells under go many steps of processing before it is handled by the visual cortex. These steps generally perform decorrelation, both spatial (surround suppression in the retina) and temporal (handling of movement in the lateral geniculate nucleus). === Cryptography === In cryptography, decorrelation is used in cipher design (see Decorrelation theory) and in the design of hardware random number generators.

    Read more →
  • Neural cryptography

    Neural cryptography

    Neural cryptography is a branch of cryptography dedicated to analyzing the application of stochastic algorithms, especially artificial neural network algorithms, for use in encryption and cryptanalysis. == Definition == Artificial neural networks are well known for their ability to selectively explore the solution space of a given problem. This feature finds a natural niche of application in the field of cryptanalysis. At the same time, neural networks offer a new approach to attack ciphering algorithms based on the principle that any function could be reproduced by a neural network, which is a powerful proven computational tool that can be used to find the inverse-function of any cryptographic algorithm. The ideas of mutual learning, self learning, and stochastic behavior of neural networks and similar algorithms can be used for different aspects of cryptography, like public-key cryptography, solving the key distribution problem using neural network mutual synchronization, hashing or generation of pseudo-random numbers. Another idea is the ability of a neural network to separate space in non-linear pieces using "bias". It gives different probabilities of activating the neural network or not. This is very useful in the case of Cryptanalysis. Two names are used to design the same domain of research: Neuro-Cryptography and Neural Cryptography. The first work that it is known on this topic can be traced back to 1995 in an IT Master Thesis. == Applications == In 1995, Sebastien Dourlens applied neural networks to cryptanalyze DES by allowing the networks to learn how to invert the S-tables of the DES. The bias in DES studied through Differential Cryptanalysis by Adi Shamir is highlighted. The experiment shows about 50% of the key bits can be found, allowing the complete key to be found in a short time. Hardware application with multi micro-controllers have been proposed due to the easy implementation of multilayer neural networks in hardware. One example of a public-key protocol is given by Khalil Shihab . He describes the decryption scheme and the public key creation that are based on a backpropagation neural network. The encryption scheme and the private key creation process are based on Boolean algebra. This technique has the advantage of small time and memory complexities. A disadvantage is the property of backpropagation algorithms: because of huge training sets, the learning phase of a neural network is very long. Therefore, the use of this protocol is only theoretical so far. == Neural key exchange protocol == The most used protocol for key exchange between two parties A and B in the practice is Diffie–Hellman key exchange protocol. Neural key exchange, which is based on the synchronization of two tree parity machines, should be a secure replacement for this method. Synchronizing these two machines is similar to synchronizing two chaotic oscillators in chaos communications. === Tree parity machine === The tree parity machine is a special type of multi-layer feedforward neural network. It consists of one output neuron, K hidden neurons and K×N input neurons. Inputs to the network take three values: x i j ∈ { − 1 , 0 , + 1 } {\displaystyle x_{ij}\in \left\{-1,0,+1\right\}} The weights between input and hidden neurons take the values: w i j ∈ { − L , . . . , 0 , . . . , + L } {\displaystyle w_{ij}\in \left\{-L,...,0,...,+L\right\}} Output value of each hidden neuron is calculated as a sum of all multiplications of input neurons and these weights: σ i = sgn ⁡ ( ∑ j = 1 N w i j x i j ) {\displaystyle \sigma _{i}=\operatorname {sgn}(\sum _{j=1}^{N}w_{ij}x_{ij})} Signum is a simple function, which returns −1,0 or 1: sgn ⁡ ( x ) = { − 1 if x < 0 , 0 if x = 0 , 1 if x > 0. {\displaystyle \operatorname {sgn}(x)={\begin{cases}-1&{\text{if }}x<0,\\0&{\text{if }}x=0,\\1&{\text{if }}x>0.\end{cases}}} If the scalar product is 0, the output of the hidden neuron is mapped to −1 in order to ensure a binary output value. The output of neural network is then computed as the multiplication of all values produced by hidden elements: τ = ∏ i = 1 K σ i {\displaystyle \tau =\prod _{i=1}^{K}\sigma _{i}} Output of the tree parity machine is binary. === Protocol === Each party (A and B) uses its own tree parity machine. Synchronization of the tree parity machines is achieved in these steps Initialize random weight values Execute these steps until the full synchronization is achieved Generate random input vector X Compute the values of the hidden neurons Compute the value of the output neuron Compare the values of both tree parity machines Outputs are the same: one of the suitable learning rules is applied to the weights Outputs are different: go to 2.1 After the full synchronization is achieved (the weights wij of both tree parity machines are same), A and B can use their weights as keys. This method is known as a bidirectional learning. One of the following learning rules can be used for the synchronization: Hebbian learning rule: w i + = g ( w i + σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Anti-Hebbian learning rule: w i + = g ( w i − σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}-\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Random walk: w i + = g ( w i + x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Where: Θ ( a , b ) = 0 {\displaystyle \Theta (a,b)=0} if a ≠ b {\displaystyle a\neq b} otherwise Θ ( a , b ) = 1 {\displaystyle \Theta (a,b)=1} And: g ( x ) {\displaystyle g(x)} is a function that keeps the w i {\displaystyle w_{i}} in the range { − L , − L + 1 , . . . , 0 , . . . , L − 1 , L } {\displaystyle \{-L,-L+1,...,0,...,L-1,L\}} === Attacks and security of this protocol === In every attack it is considered, that the attacker E can eavesdrop messages between the parties A and B, but does not have an opportunity to change them. ==== Brute force ==== To provide a brute force attack, an attacker has to test all possible keys (all possible values of weights wij). By K hidden neurons, K×N input neurons and boundary of weights L, this gives (2L+1)KN possibilities. For example, the configuration K = 3, L = 3 and N = 100 gives us 310253 key possibilities, making the attack impossible with today's computer power. ==== Learning with own tree parity machine ==== One of the basic attacks can be provided by an attacker, who owns the same tree parity machine as the parties A and B. He wants to synchronize his tree parity machine with these two parties. In each step there are three situations possible: Output(A) ≠ Output(B): None of the parties updates its weights. Output(A) = Output(B) = Output(E): All the three parties update weights in their tree parity machines. Output(A) = Output(B) ≠ Output(E): Parties A and B update their tree parity machines, but the attacker can not do that. Because of this situation his learning is slower than the synchronization of parties A and B. It has been proven, that the synchronization of two parties is faster than learning of an attacker. It can be improved by increasing of the synaptic depth L of the neural network. That gives this protocol enough security and an attacker can find out the key only with small probability. ==== Other attacks ==== For conventional cryptographic systems, we can improve the security of the protocol by increasing of the key length. In the case of neural cryptography, we improve it by increasing of the synaptic depth L of the neural networks. Changing this parameter increases the cost of a successful attack exponentially, while the effort for the users grows polynomially. Therefore, breaking the security of neural key exchange belongs to the complexity class NP. Alexander Klimov, Anton Mityaguine, and Adi Shamir say that the original neural synchronization scheme can be broken by at least three different attacks—geometric, probabilistic analysis, and using genetic algorithms. Even though this particular implementation is insecure, the ideas behind chaotic synchronization could potentially lead to a secure implementation. === Permutation parity machine === The permutation parity machine is a binary variant of the tree parity machine. It consists of one input layer, one hidden layer and one output layer. The number of neurons in the output layer depends on the number of hidden units K. Each hidden neuron has N binary input neurons: x i j ∈ { 0 , 1 } {\displaystyle x_{ij}\in \left\{0,1\right\}} The weights between input and hidden neurons are also binary: w i j ∈ { 0 , 1 } {\displaystyle w_{ij}\in \left\{0,1\right\}} Output value of each hidden neuron is calculated as a sum of all exclusive disjunctions (exclusive or) of input neurons and these weights: σ i = θ N ( ∑ j = 1 N w i j ⊕ x i j ) {\displaystyle \sigma _{i}=\theta _{N}(\sum _{j=1}^{N}w_{ij}\oplus x_{ij})} (⊕ means XOR). Th

    Read more →
  • Random neural network

    Random neural network

    The Random Neural Network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks which Erol Gelenbe also invented, and with his Gene Regulatory Network models. In this model, each neuronal cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops. A highly energy-efficient implementation of random neural networks was demonstrated by Krishna Palem et al. using the Probabilistic CMOS or PCMOS technology and was shown to be c. 226–300 times more efficient in terms of Energy-Performance-Product. RNNs are also related to artificial neural networks, which (like the random neural network) have gradient-based learning algorithms. The learning algorithm for an n-node random neural network that includes feedback loops (it is also a recurrent neural network) is of computational complexity O(n^3) (the number of computations is proportional to the cube of n, the number of neurons). The random neural network can also be used with other learning algorithms such as reinforcement learning. The RNN has been shown to be a universal approximator for bounded and continuous functions.

    Read more →