Second-order co-occurrence pointwise mutual information

Second-order co-occurrence pointwise mutual information

In computational linguistics, second-order co-occurrence pointwise mutual information (SOC-PMI) is a method used to measure semantic similarity, or how close in meaning two words are. The method does not require the two words to appear together in a text. Instead, it works by analyzing the "neighbor" words that typically appear alongside each of the two target words in a large body of text (corpus). If the two target words frequently share the same neighbors, they are considered semantically similar. For example, the words "cemetery" and "graveyard" may not appear in the same sentence often, but they both frequently appear near words like "buried," "dead," and "funeral." SOC-PMI uses this shared context to determine that they have a similar meaning. The method is called "second-order" because it doesn't look at the direct co-occurrence of the target words (which would be first-order), but at the co-occurrence of their neighbors (a second level of association). The strength of these associations is quantified using pointwise mutual information (PMI). == History == The method builds on earlier work like the PMI-IR algorithm, which used the AltaVista search engine to calculate word association probabilities. The key advantage of a second-order approach like SOC-PMI is its ability to measure similarity between words that do not co-occur often, or at all. The British National Corpus (BNC) has been used as a source for word frequencies and contexts for this method. == Methodology == The SOC-PMI algorithm measures the similarity between two words, w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} , in several steps. === Step 1: Score neighboring words with PMI === First, for each target word ( w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} ), the algorithm identifies its "neighbor" words within a certain text window (e.g., within 5 words to the left or right) across a large corpus. The strength of the association between a target word t i {\displaystyle t_{i}} and its neighbor w {\displaystyle w} is calculated using pointwise mutual information (PMI). A higher PMI value means the two words appear together more often than would be expected by chance. The PMI between a target word t i {\displaystyle t_{i}} and a neighbor word w {\displaystyle w} is calculated as: f pmi ( t i , w ) = log 2 ⁡ f b ( t i , w ) × m f t ( t i ) f t ( w ) {\displaystyle f^{\text{pmi}}(t_{i},w)=\log _{2}{\frac {f^{b}(t_{i},w)\times m}{f^{t}(t_{i})f^{t}(w)}}} where: f b ( t i , w ) {\displaystyle f^{b}(t_{i},w)} is the number of times t i {\displaystyle t_{i}} and w {\displaystyle w} appear together in the context window. f t ( t i ) {\displaystyle f^{t}(t_{i})} is the total number of times t i {\displaystyle t_{i}} appears in the corpus. f t ( w ) {\displaystyle f^{t}(w)} is the total number of times w {\displaystyle w} appears in the corpus. m {\displaystyle m} is the total number of tokens (words) in the corpus. === Step 2: Create a semantic 'signature' for each word === For each target word ( w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} ), the algorithm creates a list of its most significant neighbors. This is done by taking the top β {\displaystyle \beta } neighbor words, sorted in descending order by their PMI score with the target word. This list of top neighbors, X w {\displaystyle X^{w}} , acts as a semantic "signature" for the word w {\displaystyle w} . X w = { X i w } {\displaystyle X^{w}=\{X_{i}^{w}\}} , for i = 1 , 2 , … , β {\displaystyle i=1,2,\ldots ,\beta } The size of this list, β {\displaystyle \beta } , is a parameter of the method. === Step 3: Compare the signatures === The algorithm then compares the signatures of w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} . It looks for words that are present in both signatures. The similarity of w 1 {\displaystyle w_{1}} to w 2 {\displaystyle w_{2}} is calculated by summing the PMI scores of w 2 {\displaystyle w_{2}} with every word in w 1 {\displaystyle w_{1}} 's signature list. The β {\displaystyle \beta } -PMI summation function defines this score. The score for w 1 {\displaystyle w_{1}} with respect to w 2 {\displaystyle w_{2}} is: f ( w 1 , w 2 , β ) = ∑ i = 1 β ( f pmi ( X i w 1 , w 2 ) ) γ {\displaystyle f(w_{1},w_{2},\beta )=\sum _{i=1}^{\beta }(f^{\text{pmi}}(X_{i}^{w_{1}},w_{2}))^{\gamma }} This sum only includes terms where the PMI value is positive. The exponent γ {\displaystyle \gamma } (with a value > 1) is used to give more weight to neighbors that are more strongly associated with w 2 {\displaystyle w_{2}} . This calculation is done in both directions: The similarity of w 1 {\displaystyle w_{1}} with respect to w 2 {\displaystyle w_{2}} : f ( w 1 , w 2 , β 1 ) = ∑ i = 1 β 1 ( f pmi ( X i w 1 , w 2 ) ) γ {\displaystyle f(w_{1},w_{2},\beta _{1})=\sum _{i=1}^{\beta _{1}}(f^{\text{pmi}}(X_{i}^{w_{1}},w_{2}))^{\gamma }} The similarity of w 2 {\displaystyle w_{2}} with respect to w 1 {\displaystyle w_{1}} : f ( w 2 , w 1 , β 2 ) = ∑ i = 1 β 2 ( f pmi ( X i w 2 , w 1 ) ) γ {\displaystyle f(w_{2},w_{1},\beta _{2})=\sum _{i=1}^{\beta _{2}}(f^{\text{pmi}}(X_{i}^{w_{2}},w_{1}))^{\gamma }} === Step 4: Calculate final similarity score === Finally, the total semantic similarity is the average of the two scores from the previous step. S i m ( w 1 , w 2 ) = f ( w 1 , w 2 , β 1 ) β 1 + f ( w 2 , w 1 , β 2 ) β 2 {\displaystyle \mathrm {Sim} (w_{1},w_{2})={\frac {f(w_{1},w_{2},\beta _{1})}{\beta _{1}}}+{\frac {f(w_{2},w_{1},\beta _{2})}{\beta _{2}}}} This score can be normalized to fall between 0 and 1. For example, using this method, the words cemetery and graveyard achieve a high similarity score of 0.986 (with specific parameter settings).

JAX (software)

JAX is a Python library for accelerator-oriented array computation and program transformation, designed for high-performance numerical computing and large-scale machine learning. It is developed by Google with contributions from Nvidia and other community contributors. It is described as bringing together a modified version of the automatic differentiation system autograd and OpenXLA's XLA (Accelerated Linear Algebra). It is designed to follow the structure and workflow of NumPy as closely as possible and works with various existing frameworks such as TensorFlow and PyTorch. The primary features of JAX are: Providing a unified NumPy-like interface to computations that run on CPU, GPU, or TPU, in local or distributed settings. Built-in Just-In-Time (JIT) compilation via OpenXLA, an open-source machine learning compiler ecosystem. Efficient evaluation of gradients via its automatic differentiation transformations. Automatic vectorization to efficiently map functions over arrays representing batches of inputs. == Libraries using Jax == Flax Equinox Optax

Corel Designer

Corel DESIGNER is a vector-based graphics program. It was originally developed by Micrografx, which was bought by Corel in 2001. The last version developed by Micrografx was 9.0 in 2001. This program was later sold as Corel DESIGNER 9. There are still a number of users who continue working with version 9.0, because newer versions of the product are based on a modified CorelDRAW rather than the original product. Corel DESIGNER is effective for the creation of engineering drawings, but also offers many functions for graphic design. Starting with version X5, Corel DESIGNER Technical Suite includes Corel Designer, CorelDRAW and Corel Photo-Paint. X6 was the last release for Windows XP. == Release history and file formats ==

Kernel (image processing)

In image processing, a kernel, convolution matrix, or mask is a small matrix used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between the kernel and an image. Or more simply, when each pixel in the output image is a function of the nearby pixels (including itself) in the input image, the kernel is that function. == Details == The general expression of a convolution is g x , y = ω ∗ f x , y = ∑ i = − a a ∑ j = − b b ω i , j f x − i , y − j , {\displaystyle g_{x,y}=\omega f_{x,y}=\sum _{i=-a}^{a}{\sum _{j=-b}^{b}{\omega _{i,j}f_{x-i,y-j}}},} where g ( x , y ) {\displaystyle g(x,y)} is the filtered image, f ( x , y ) {\displaystyle f(x,y)} is the original image, ω {\displaystyle \omega } is the filter kernel. Every element of the filter kernel is considered by − a ≤ i ≤ a {\displaystyle -a\leq i\leq a} and − b ≤ j ≤ b {\displaystyle -b\leq j\leq b} . Depending on the element values, a kernel can cause a wide range of effects: The above are just a few examples of effects achievable by convolving kernels and images. === Origin === The origin is the position of the kernel which is above (conceptually) the current output pixel. This could be outside of the actual kernel, though usually it corresponds to one of the kernel elements. For a symmetric kernel, the origin is usually the center element. == Convolution == Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution. The matrix operation being performed—convolution—is not traditional matrix multiplication, despite being similarly denoted by . For example, if we have two three-by-three matrices, the first a kernel, and the second an image piece, convolution is the process of flipping both the rows and columns of the kernel and multiplying locally similar entries and summing. The element at coordinates [2, 2] (that is, the central element) of the resulting image would be a weighted combination of all the entries of the image matrix, with weights given by the kernel: ( [ a b c d e f g h i ] ∗ [ 1 2 3 4 5 6 7 8 9 ] ) [ 2 , 2 ] = {\displaystyle \left({\begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}}{\begin{bmatrix}1&2&3\\4&5&6\\7&8&9\end{bmatrix}}\right)[2,2]=} ( i ⋅ 1 ) + ( h ⋅ 2 ) + ( g ⋅ 3 ) + ( f ⋅ 4 ) + ( e ⋅ 5 ) + ( d ⋅ 6 ) + ( c ⋅ 7 ) + ( b ⋅ 8 ) + ( a ⋅ 9 ) . {\displaystyle (i\cdot 1)+(h\cdot 2)+(g\cdot 3)+(f\cdot 4)+(e\cdot 5)+(d\cdot 6)+(c\cdot 7)+(b\cdot 8)+(a\cdot 9).} The other entries would be similarly weighted, where we position the center of the kernel on each of the boundary points of the image, and compute a weighted sum. The values of a given pixel in the output image are calculated by multiplying each kernel value by the corresponding input image pixel values. This can be described algorithmically with the following pseudo-code: for each image row in input image: for each pixel in image row: set accumulator to zero for each kernel row in kernel: for each element in kernel row: if element position corresponding to pixel position then multiply element value corresponding to pixel value add result to accumulator endif set output image pixel to accumulator corresponding input image pixels are found relative to the kernel's origin. If the kernel is symmetric then place the center (origin) of the kernel on the current pixel. The kernel will overlap the neighboring pixels around the origin. Each kernel element should be multiplied with the pixel value it overlaps with and all of the obtained values should be summed. This resultant sum will be the new value for the current pixel currently overlapped with the center of the kernel. If the kernel is not symmetric, it has to be flipped both around its horizontal and vertical axis before calculating the convolution as above. The general form for matrix convolution is [ x 11 x 12 ⋯ x 1 n x 21 x 22 ⋯ x 2 n ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 ⋯ x m n ] ∗ [ y 11 y 12 ⋯ y 1 n y 21 y 22 ⋯ y 2 n ⋮ ⋮ ⋱ ⋮ y m 1 y m 2 ⋯ y m n ] = ∑ i = 0 m − 1 ∑ j = 0 n − 1 x ( m − i ) ( n − j ) y ( 1 + i ) ( 1 + j ) {\displaystyle {\begin{bmatrix}x_{11}&x_{12}&\cdots &x_{1n}\\x_{21}&x_{22}&\cdots &x_{2n}\\\vdots &\vdots &\ddots &\vdots \\x_{m1}&x_{m2}&\cdots &x_{mn}\\\end{bmatrix}}{\begin{bmatrix}y_{11}&y_{12}&\cdots &y_{1n}\\y_{21}&y_{22}&\cdots &y_{2n}\\\vdots &\vdots &\ddots &\vdots \\y_{m1}&y_{m2}&\cdots &y_{mn}\\\end{bmatrix}}=\sum _{i=0}^{m-1}\sum _{j=0}^{n-1}x_{(m-i)(n-j)}y_{(1+i)(1+j)}} === Edge handling === Kernel convolution usually requires values from pixels outside of the image boundaries. There are a variety of methods for handling image edges. Extend The nearest border pixels are conceptually extended as far as necessary to provide values for the convolution. Corner pixels are extended in 90° wedges. Other edge pixels are extended in lines. Wrap The image is conceptually wrapped (or tiled) and values are taken from the opposite edge or corner. Mirror The image is conceptually mirrored at the edges. For example, attempting to read a pixel 3 units outside an edge reads one 3 units inside the edge instead. Crop / Avoid overlap Any pixel in the output image which would require values from beyond the edge is skipped. This method can result in the output image being slightly smaller, with the edges having been cropped. Move kernel so that values from outside of image is never required. Machine learning mainly uses this approach. Example: Kernel size 10x10, image size 32x32, result image is 23x23. Kernel Crop Any pixel in the kernel that extends past the input image isn't used and the normalizing is adjusted to compensate. Constant Use constant value for pixels outside of image. Usually black or sometimes gray is used. Generally this depends on application. === Normalization === Normalization is defined as the division of each element in the kernel by the sum of all kernel elements, so that the sum of the elements of a normalized kernel is unity. This will ensure the average pixel in the modified image is as bright as the average pixel in the original image. === Optimization === Fast convolution algorithms include: separable convolution ==== Separable convolution ==== 2D convolution with an M × N kernel requires M × N multiplications for each sample (pixel). If the kernel is separable, then the computation can be reduced to M + N multiplications. Using separable convolutions can significantly decrease the computation by doing 1D convolution twice instead of one 2D convolution. === Implementation === Here a concrete convolution implementation done with the GLSL shading language :

OpenPipeline

openPipeline is an open-source plug-in for Autodesk Maya that is designed to assist in a Production Pipeline structure and Computer animation. == Development == Created in Maya Embedded Language, openPipeline was initiated at Eyebeam Atelier and further developed at Pratt Institute in the Digital Arts Lab. The initial release date was December 28, 2006. == Contributors == Rob O'Neill (Creator) Paris Mavroidis Meng-Han Ho

Trigger list

Trigger list in its most general meaning refers to a list whose items are used to initiate ("trigger") certain actions. == United States: Private financial information == In the United States, when a person applies for a mortgage loan, the lender makes a credit inquiry about the potential borrower from the national credit bureaus, Equifax, Experian and TransUnion. Unless the borrower is opted out, the credit bureaus put the applicants onto a "trigger list" of "leads" about persons who are interested in new loans. These lists are sold to numerous lenders all over the United States, and soon after the application the applicant starts receiving offers from all parts of the country. The trigger lists contain a significant amount of personal financial information. Among the buyers of trigger lists are "lead generators" which resell filtered information to borrowers, e.g., of people who live in a certain area and have a certain credit score. While the Federal Trade Commission considers the market of "trigger lists" to be a legal business, many people and organizations (such as the National Association of Mortgage Brokers) consider this a serious breach of privacy and lobby for putting this practice under regulatory controls. As of now, American consumers may opt-out from "trigger lists" by calling 1-888-5-OPTOUT (1-888-567-8688). == Nuclear non-proliferation == The Zangger Committee and the Nuclear Suppliers Group maintain lists of items that may contribute to nuclear proliferation; The nuclear non-proliferation treaty forbids its members to export such items to non-treaty members. these items are said to trigger the countries' responsibilities under the NPT, hence the name.

Dailyhunt

Dailyhunt (formerly Newshunt) is an Indian content and news aggregator application based in Bangalore, India that provides local language content in 14 Indian languages from multiple content providers. Viru serves as Founder of Dailyhunt with Co-founder Umang Bedi. == History == Dailyhunt, earlier called Newshunt, was created as a Symbian app in 2009 by two ex-Nokia employees Umesh Kulkarni and Chandrashekhar Sohoni. Later in 2011, Newshunt became available on the Android platform. It was by that time that Virendra Gupta, founder of Verse acquired the application. Virendra Gupta, better known as Viru, had started Verse in 2007 as a value-added service (VAS) company. In 2011, he acquired Newshunt from its owners Umesh and Chandrashekhar. Umesh became the CTO and stayed on to oversee its transition towards the smartphone era. In 2015, Viru renamed Newshunt as Dailyhunt. In early 2018, Viru roped in Umang Bedi, to be the President of Dailyhunt and lead the business with him while focusing on making the benefits of the platform available to a larger audience. Umang was elevated to co-founder in 2020. == Funding == In September 2014, Dailyhunt (then known as Newshunt) closed its Series B funding of INR 1 billion ( or approx $12 million in 2014) from Sequoia Capital India. The Series C funding round was led by Falcon Capital and was closed with $40 million in February 2015. In October 2016, the company received its Series D funding of $25 million from ByteDance and a Series E funding of $6.39 million from Falcon Edge Capital in September 2018. Additionally, Dailyhunt raised $3 Mn (INR 21.75 Cr) in a Series F funding round from Stonebridge Capital in August 2019. Other investors of Dailyhunt include Matrix Partners India, Omidyar Network, Goldman Sachs and Sofina. == Tie-ups and partnerships == In January 2021, Dailyhunt partnered with Twitter to bring ‘Twitter Moments’ to the Indian social app. Dailyhunt app now has a dedicated tab called “Twitter Moments India” to showcase curated tweets pertaining to news and other events. In January 2021, Dailyhunt announced the premiere of Season 2 of the popular show QuoteUnquote with KK (Kapil Khandelwal) on the app. It was the first podcast to have been launched on the Dailyhunt app. In September 2020, Dailyhunt signed up as an Associate Sponsor with Star Sports for Dream 11 IPL 2020. In May 2020, Snapdeal partnered with Dailyhunt to add new content on marketplace. In March 2019, Discovery Communications India, the factual entertainment network, entered into a multi-year partnership with Dailyhunt to showcase short-form content.