Glushkov's construction algorithm

In computer science theory – particularly formal language theory – Glushkov's construction algorithm, invented by Victor Mikhailovich Glushkov, transforms a given regular expression into an equivalent nondeterministic finite automaton (NFA). Thus, it forms a bridge between regular expressions and nondeterministic finite automata: two abstract representations of the same class of formal languages. A regular expression may be used to conveniently describe an advanced search pattern in a "find and replace"–like operation of a text processing utility. Glushkov's algorithm can be used to transform it into an NFA, which furthermore is small by nature, as the number of its states equals the number of symbols of the regular expression, plus one. Subsequently, the NFA can be made deterministic by the powerset construction and then be minimized to get an optimal automaton corresponding to the given regular expression. The latter format is best suited for execution on a computer. From another, more theoretical point of view, Glushkov's algorithm is a part of the proof that NFA and regular expressions both accept exactly the same languages; that is, the regular languages. The converse of Glushkov's algorithm is Kleene's algorithm, which transforms a finite automaton into a regular expression. The automaton obtained by Glushkov's construction is the same as the one obtained by Thompson's construction algorithm, once its ε-transitions are removed. Glushkov's construction algorithm is also called The algorithm of Berry-Sethi, named after Gérard Berry and Ravi Sethi who worked on this construction. == Construction == Given a regular expression e, the Glushkov Construction Algorithm creates a non-deterministic automaton that accepts the language L ( e ) {\displaystyle L(e)} accepted by e. The construction uses four steps: === Step 1 === Linearisation of the expression. Each letter of the alphabet appearing in the expression e is renamed, so that each letter occurs at most once in the new expression e ′ {\displaystyle e'} . Glushkov's construction essentially relies on the fact that e ′ {\displaystyle e'} represents a local language L ( e ′ ) {\displaystyle L(e')} . Let A be the old alphabet and let B be the new one. === Step 2a === Computation of the sets P ( e ′ ) {\displaystyle P(e')} , D ( e ′ ) {\displaystyle D(e')} , and F ( e ′ ) {\displaystyle F(e')} . The first, P ( e ′ ) {\displaystyle P(e')} , is the set of letters which occurs as first letter of a word of L ( e ′ ) {\displaystyle L(e')} . The second, D ( e ′ ) {\displaystyle D(e')} , is the set of letters that can end a word of L ( e ′ ) {\displaystyle L(e')} . The last one, F ( e ′ ) {\displaystyle F(e')} , is the set of letter pairs that can occur in words of L ( e ′ ) {\displaystyle L(e')} , i.e. it is the set of factors of length two of the words of L ( e ′ ) {\displaystyle L(e')} . Those sets are mathematically defined by P ( e ′ ) = { x ∈ B ∣ x B ∗ ∩ L ( e ′ ) ≠ ∅ } {\displaystyle P(e')=\{x\in B\mid xB^{}\cap L(e')\neq \emptyset \}} , D ( e ′ ) = { y ∈ B ∣ B ∗ y ∩ L ( e ′ ) ≠ ∅ } {\displaystyle D(e')=\{y\in B\mid B^{}y\cap L(e')\neq \emptyset \}} , F ( e ′ ) = { u ∈ B 2 ∣ B ∗ u B ∗ ∩ L ( e ′ ) ≠ ∅ } {\displaystyle F(e')=\{u\in B^{2}\mid B^{}uB^{}\cap L(e')\neq \emptyset \}} . They are computed by induction over the structure of the expression, as explained below. === Step 2b === Computation of the set Λ ( e ′ ) {\displaystyle \Lambda (e')} which contains the empty word ε {\displaystyle \varepsilon } if this word belongs to L ( e ′ ) {\displaystyle L(e')} , and is the empty set otherwise. Formally, this is Λ ( e ′ ) = { ε } ∩ L ( e ′ ) {\displaystyle \Lambda (e')=\{\varepsilon \}\cap L(e')} . === Step 3 === Computation of automaton recognizing the local language, as defined by P ( e ′ ) {\displaystyle P(e')} , D ( e ′ ) {\displaystyle D(e')} , F ( e ′ ) {\displaystyle F(e')} , and Λ ( e ′ ) {\displaystyle \Lambda (e')} . By definition, the local language defined by the sets P, D, and F is the set of words which begin with a letter of P, end by a letter of D, and whose factors of length 2 belong to F, optionally also including the empty word; that is, it is the language: L ′ = ( P B ∗ ∩ B ∗ D ) ∖ B ∗ ( B 2 ∖ F ) B ∗ ∪ Λ ( e ′ ) {\displaystyle L'=(PB^{}\cap B^{}D)\setminus B^{}(B^{2}\setminus F)B^{}\cup \Lambda (e')} . Strictly speaking, it is the computation of the automaton for the local language denoted by this linearised expression that is Glushkov's construction. === Step 4 === Remove the linearisation, replacing each indexed letter B by the original letter of A. == Example == Consider the regular expression e = ( a ( a b ) ∗ ) ∗ + ( b a ) ∗ {\displaystyle e=(a(ab)^{})^{}+(ba)^{}} . == Computation of the set of letters == The computation of the sets P, D, F, and Λ is done inductively over the regular expression e ′ {\displaystyle e'} . One must give the values for ∅, ε (the symbols for the empty language and the singleton language containing the empty word), the letters, and the results of the operations + , ⋅ , ∗ {\displaystyle +,\cdot ,^{}} . The most costly operations are the cartesian products of sets for the computation of F. == Properties == The obtained automaton is non-deterministic, and it has as many states as the number of letters of the regular expression, plus one. It has been proven that every Thompson's automaton can be transformed into Glushkov's automaton via a ε-transitions elimination method. == Applications and deterministic expressions == The computation of the automaton by the expression occurs often; it has been systematically used in search functions, in particular by the Unix grep command. Similarly, XML's specification also uses such constructions; for more efficiency, regular expressions of a certain kind, called deterministic expressions, have been studied.

Landweber iteration

The Landweber iteration or Landweber algorithm is an algorithm to solve ill-posed linear inverse problems, and it has been extended to solve non-linear problems that involve constraints. The method was first proposed in the 1950s by Louis Landweber, and it can be now viewed as a special case of many other more general methods. == Basic algorithm == The original Landweber algorithm attempts to recover a signal x from (noisy) measurements y. The linear version assumes that y = A x {\displaystyle y=Ax} for a linear operator A. When the problem is in finite dimensions, A is just a matrix. When A is nonsingular, then an explicit solution is x = A − 1 y {\displaystyle x=A^{-1}y} . However, if A is ill-conditioned, the explicit solution is a poor choice since it is sensitive to any noise in the data y. If A is singular, this explicit solution doesn't even exist. The Landweber algorithm is an attempt to regularize the problem, and is one of the alternatives to Tikhonov regularization. We may view the Landweber algorithm as solving: min x ‖ A x − y ‖ 2 2 / 2 {\displaystyle \min _{x}\|Ax-y\|_{2}^{2}/2} using an iterative method. The algorithm is given by the update x k + 1 = x k − ω A ∗ ( A x k − y ) . {\displaystyle x_{k+1}=x_{k}-\omega A^{}(Ax_{k}-y).} where the relaxation factor ω {\displaystyle \omega } satisfies 0 < ω < 2 / σ 1 2 {\displaystyle 0<\omega <2/\sigma _{1}^{2}} . Here σ 1 {\displaystyle \sigma _{1}} is the largest singular value of A {\displaystyle A} . If we write f ( x ) = ‖ A x − y ‖ 2 2 / 2 {\displaystyle f(x)=\|Ax-y\|_{2}^{2}/2} , then the update can be written in terms of the gradient x k + 1 = x k − ω ∇ f ( x k ) {\displaystyle x_{k+1}=x_{k}-\omega \nabla f(x_{k})} and hence the algorithm is a special case of gradient descent. For ill-posed problems, the iterative method needs to be stopped at a suitable iteration index, because it semi-converges. This means that the iterates approach a regularized solution during the first iterations, but become unstable in further iterations. The reciprocal of the iteration index 1 / k {\displaystyle 1/k} acts as a regularization parameter. A suitable parameter is found, when the mismatch ‖ A x k − y ‖ 2 2 {\displaystyle \|Ax_{k}-y\|_{2}^{2}} approaches the noise level. Using the Landweber iteration as a regularization algorithm has been discussed in the literature. == Nonlinear extension == In general, the updates generated by x k + 1 = x k − τ ∇ f ( x k ) {\displaystyle x_{k+1}=x_{k}-\tau \nabla f(x_{k})} will generate a sequence f ( x k ) {\displaystyle f(x_{k})} that converges to a minimizer of f whenever f is convex and the stepsize τ {\displaystyle \tau } is chosen such that 0 < τ < 2 / ( ‖ ∇ f ‖ 2 ) {\displaystyle 0<\tau <2/(\|\nabla f\|^{2})} where ‖ ⋅ ‖ {\displaystyle \|\cdot \|} is the spectral norm. Since this is special type of gradient descent, there currently is not much benefit to analyzing it on its own as the nonlinear Landweber, but such analysis was performed historically by many communities not aware of unifying frameworks. The nonlinear Landweber problem has been studied in many papers in many communities; see, for example. == Extension to constrained problems == If f is a convex function and C is a convex set, then the problem min x ∈ C f ( x ) {\displaystyle \min _{x\in C}f(x)} can be solved by the constrained, nonlinear Landweber iteration, given by: x k + 1 = P C ( x k − τ ∇ f ( x k ) ) {\displaystyle x_{k+1}={\mathcal {P}}_{C}(x_{k}-\tau \nabla f(x_{k}))} where P {\displaystyle {\mathcal {P}}} is the projection onto the set C. Convergence is guaranteed when 0 < τ < 2 / ( ‖ A ‖ 2 ) {\displaystyle 0<\tau <2/(\|A\|^{2})} . This is again a special case of projected gradient descent (which is a special case of the forward–backward algorithm) as discussed in. == Applications == Since the method has been around since the 1950s, it has been adopted and rediscovered by many scientific communities, especially those studying ill-posed problems. In X-ray computed tomography it is called simultaneous iterative reconstruction technique (SIRT). It has also been used in the computer vision community and the signal restoration community. It is also used in image processing, since many image problems, such as deconvolution, are ill-posed. Variants of this method have been used also in sparse approximation problems and compressed sensing settings.

Visopsys

Visopsys (Visual Operating System), is an operating system, written by Andy McLaughlin. Development of the operating system began in 1997. The operating system is licensed under the GNU GPL, with the headers and libraries under the less restrictive LGPL license. It runs on the 32-bit IA-32 architecture. It features a multitasking kernel, supports asynchronous I/O and the FAT line of file systems. It requires a Pentium processor. == History == The development of Visopsys began in 1997, being written by Andy McLaughlin. The first public release of the Operating System was on 2 March 2001, with version 0.1. In this release, Visopsys was a 32 bit operating system, supporting preemptive multitasking and virtual memory. == System overview == Visopsys uses a monolithic kernel, written in the C programming language, with elements of assembly language for certain interactions with the hardware. The operating system supports a graphical user interface, with a small C library.

CodePen

CodePen is an online community for testing and showcasing user-created HTML, CSS and JavaScript code snippets. It functions as an online code editor and open-source learning environment, where developers can create code snippets, called "pens," and test them. It was founded in 2012 by full-stack developers Alex Vazquez and Tim Sabat and front-end designer Chris Coyier. Its employees work remotely, rarely all meeting together in person. CodePen is a large community for web designers and developers to showcase their coding skills, with an estimated 330,000 registered users and 14.16 million monthly visitors.

Extremely online

An extremely online (often capitalized), terminally online, or chronically online person is someone who is closely engaged with Internet culture. People said to be extremely online often believe that online posts are very important. Events and phenomena can themselves be extremely online; while often used as a descriptive term, the phenomenon of extreme online usage has been described as "both a reformation of the delivery of ideas – shared through words and videos and memes and GIFs and copypasta – and the ideas themselves". Here, "online" is used to describe "a way of doing things, not [simply] the place they are done". == Criteria == While the term was in use as early as 2014, it gained popularity over the latter half of the 2010s in conjunction with the increasing prevalence and notability of Internet phenomena in all areas of life. Extremely online people, according to The Daily Dot, are interested in topics "no normal, healthy person could possibly care about", and have been analogized to "pop culture fandoms, just without the pop". Extremely online phenomena such as fan culture and reaction GIFs have been described as "swallowing democracy" by journalists such as Amanda Hess in The New York Times, who claimed that a "great convergence between politics and culture, values and aesthetics, citizenship and commercialism" had become "a dominant mode of experiencing politics". Vulture – formerly the pop culture section of New York magazine, now a stand-alone website – has a section for articles tagged "extremely online". == Historical background == In the 2010s, many categories and labels came into wide use from media outlets to describe Internet-mediated cultural trends, such as the alt-right, the dirtbag left, and doomerism. These ideological categories are often defined by their close association with online discourse. For example, the term "alt-right" was added to the Associated Press' stylebook in 2016 to describe the "digital presence" of far-right ideologies, the dirtbag left refers to a group of "underemployed and overly online millennials" who "have no time for the pieties of traditional political discourse", and the doomer's "blackpilled despair" is combined with spending "too much time on message boards in high school" to produce an eclectic "anti-socialism". Extreme onlineness transcends ideological boundaries. For example, right-wing figures like Alex Jones and Laura Loomer have been described as "extremely online", but so have those on the left like Alexandria Ocasio-Cortez and fans of the Chapo Trap House podcast. Extremely online phenomena can range from acts of offline violence (such as the 2019 Christchurch shootings) to "[going] on NPR to explain the anti-capitalist irony inherent in kids eating Tide Pods". United States President Donald Trump's posts on social media have been frequently cited as extremely online, during both his presidency and his 2020 presidential campaign; Vox claimed his approach to re-election veered into being "Too Online", and Reason questioned whether the final presidential debate was "incomprehensible to normies". While individual people are often given the description, being extremely online has also been posited as an overall cultural phenomenon, applying to trends like lifestyle movements suffixed with "-wave" and "-core" based heavily on Internet media, as well as an increasing expectation for digital social researchers to have an "online presence" to advance in their careers. == Participants and media coverage == One example of a phenomenon considered to be extremely online is the "wife guy" (a guy who posts about his wife); despite being a "stupid online thing" which spent several years as a piece of Internet slang, in 2019 it became the subject of five articles in leading U.S. media outlets. Like many extremely online phrases and phenomena, the "wife guy" has been attributed in part to the in-character Twitter account dril. The account frequently parodies how people behave on the Internet, and has been widely cited as influential on online culture. In one tweet, his character refuses to stop using the Internet, even when someone shouts outside his house that he should log off. Many of dril's other coinages have become ubiquitous parts of Internet slang. Throughout the 2010s, posters such as dril inspired commonly used terms like "corncobbing" (referring to someone losing an argument and failing to admit it); while originally a piece of obscure Internet slang used on sites like Twitter, use of the term (and controversy over its misinterpretation) became a subject of reporting from traditional publications, with some noting that keeping up with the rapid turnover of inside jokes, memes, and quotes online required daily attention to avoid embarrassment. Twitch has been described as "talk radio for the extremely online". Another example of an event cited as extremely online is No Nut November. Increasingly, researchers are expected to have more of an online presence, to advance in their careers, as networking and portfolios continue to transition to the digital world. In November 2020, an article in The Washington Post criticized the filter bubble theory of online discourse on the basis that it "overgeneralized" based on a "small subset of extremely online people". The 2021 storming of the United States Capitol was described as extremely online, with "pro-Trump internet personalities", such as Baked Alaska, and fans livestreaming and taking selfies. People who have been described as extremely online include Chrissy Teigen, Jon Ossoff, and Andrew Yang. In contrast, Joe Biden has been cited as the antithesis of extremely online—The New York Times wrote in 2019 that he had "zero meme energy".

Legal information retrieval

Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics. In a legal setting, it is frequently important to retrieve all information related to a specific query. However, commonly used boolean search methods (exact matches of specified terms) on full text legal documents have been shown to have an average recall rate as low as 20 percent, meaning that only 1 in 5 relevant documents are actually retrieved. In that case, researchers believed that they had retrieved over 75% of relevant documents. This may result in failing to retrieve important or precedential cases. In some jurisdictions this may be especially problematic, as legal professionals are ethically obligated to be reasonably informed as to relevant legal documents. Legal Information Retrieval attempts to increase the effectiveness of legal searches by increasing the number of relevant documents (providing a high recall rate) and reducing the number of irrelevant documents (a high precision rate). This is a difficult task, as the legal field is prone to jargon, polysemes (words that have different meanings when used in a legal context), and constant change. Techniques used to achieve these goals generally fall into three categories: boolean retrieval, manual classification of legal text, and natural language processing of legal text. == Problems == Application of standard information retrieval techniques to legal text can be more difficult than application in other subjects. One key problem is that the law rarely has an inherent taxonomy. Instead, the law is generally filled with open-ended terms, which may change over time. This can be especially true in common law countries, where each decided case can subtly change the meaning of a certain word or phrase. Legal information systems must also be programmed to deal with law-specific words and phrases. Though this is less problematic in the context of words which exist solely in law, legal texts also frequently use polysemes, words may have different meanings when used in a legal or common-speech manner, potentially both within the same document. The legal meanings may be dependent on the area of law in which it is applied. For example, in the context of European Union legislation, the term "worker" has four different meanings: Any worker as defined in Article 3(a) of Directive 89/391/EEC who habitually uses display screen equipment as a significant part of his normal work. Any person employed by an employer, including trainees and apprentices but excluding domestic servants; Any person carrying out an occupation on board a vessel, including trainees and apprentices, but excluding port pilots and shore personnel carrying out work on board a vessel at the quayside; Any person who, in the Member State concerned, is protected as an employee under national employment law and in accordance with national practice; It also has the common meaning: A person who works at a specific occupation. Though the terms may be similar, correct information retrieval must differentiate between the intended use and irrelevant uses in order to return the correct results. Even if a system overcomes the language problems inherent in law, it must still determine the relevancy of each result. In the context of judicial decisions, this requires determining the precedential value of the case. Case decisions from senior or superior courts may be more relevant than those from lower courts, even where the lower court's decision contains more discussion of the relevant facts. The opposite may be true, however, if the senior court has only a minor discussion of the topic (for example, if it is a secondary consideration in the case). An information retrieval system must also be aware of the authority of the jurisdiction. A case from a binding authority is most likely of more value than one from a non-binding authority. Additionally, the intentions of the user may determine which cases they find valuable. For instance, where a legal professional is attempting to argue a specific interpretation of law, he might find a minor court's decision which supports his position more valuable than a senior courts position which does not. He may also value similar positions from different areas of law, different jurisdictions, or dissenting opinions. Overcoming these problems can be made more difficult because of the large number of cases available. The number of legal cases available via electronic means is constantly increasing (in 2003, US appellate courts handed down approximately 500 new cases per day), meaning that an accurate legal information retrieval system must incorporate methods of both sorting past data and managing new data. == Techniques == === Boolean searches === Boolean searches, where a user may specify terms such as use of specific words or judgments by a specific court, are the most common type of search available via legal information retrieval systems. They are widely implemented but overcome few of the problems discussed above. The recall and precision rates of these searches vary depending on the implementation and searches analyzed. One study found a basic boolean search's recall rate to be roughly 20%, and its precision rate to be roughly 79%. Another study implemented a generic search (that is, not designed for legal uses) and found a recall rate of 56% and a precision rate of 72% among legal professionals. Both numbers increased when searches were run by non-legal professionals, to a 68% recall rate and 77% precision rate. This is likely explained because of the use of complex legal terms by the legal professionals. === Manual classification === In order to overcome the limits of basic boolean searches, information systems have attempted to classify case laws and statutes into more computer friendly structures. Usually, this results in the creation of an ontology to classify the texts, based on the way a legal professional might think about them. These attempt to link texts on the basis of their type, their value, and/or their topic areas. Most major legal search providers now implement some sort of classification search, such as Westlaw's “Natural Language” or LexisNexis' Headnote searches. Additionally, both of these services allow browsing of their classifications, via Westlaw's West Key Numbers or Lexis' Headnotes. Though these two search algorithms are proprietary and secret, it is known that they employ manual classification of text (though this may be computer-assisted). These systems can help overcome the majority of problems inherent in legal information retrieval systems, in that manual classification has the greatest chances of identifying landmark cases and understanding the issues that arise in the text. In one study, ontological searching resulted in a precision rate of 82% and a recall rate of 97% among legal professionals. The legal texts included, however, were carefully controlled to just a few areas of law in a specific jurisdiction. The major drawback to this approach is the requirement of using highly skilled legal professionals and large amounts of time to classify texts. As the amount of text available continues to increase, some have stated their belief that manual classification is unsustainable. === Natural language processing === In order to reduce the reliance on legal professionals and the amount of time needed, efforts have been made to create a system to automatically classify legal text and queries. Adequate translation of both would allow accurate information retrieval without the high cost of human classification. These automatic systems generally employ Natural Language Processing (NLP) techniques that are adapted to the legal domain, and also require the creation of a legal ontology. Though multiple systems have been postulated, few have reported results. One system, “SMILE,” which attempted to automatically extract classifications from case texts, resulted in an f-measure (which is a calculation of both recall rate and precision) of under 0.3 (compared to perfect f-measure of 1.0). This is probably much lower than an acceptable rate for general usage. Despite the limited results, many theorists predict that the evolution of such systems will eventually replace manual classification systems. === Citation-Based ranking === In the mid-90s the Room 5 case law retrieval project used citation mining for summaries and ranked its search results based on citation type and count. This slightly pre-dated the PageRank algorithm at Stanford which was also a citation-based ranking. Ranking of results was based

FactorDaily

FactorDaily is an Indian digital media publication founded in 2016 by Pankaj Mishra and Jayadevan PK. Mishra was formerly an Editor at TechCrunch and the Economic Times. The digital publication was launched with an intent to produce stories on the impact of technology on life in India. == History == FactorDaily began publishing in May 2016, with daily reported stories on technology, culture and life in India. Prior to its launch, the company had raised $1 million in seed funding from Accel India, Blume Ventures, Girish Mathrubootham of Freshdesk, Vijay Shekhar Sharma of PayTm, and Jay Vijayan of Tekion. Josey Puliyenthuruthel John, formerly Managing Editor at Business Today and National Corporate Editor at Mint, later joined the company as a Consulting Editor. In January 2017, FactorDaily launched its first Podcast called The Outliers. The inaugural episode featured a conversation with Manish Sharma of Printo on his journey starting up. == Awards == The FactorDaily team won the Bengaluru Editors Lab 2017, a journalism hackathon organised by the Global Editors Network (GEN). The story titled "India has 3,800 psychiatrists for 1.2bn people. Can tech step in to manage mental health?" won the first prize in the online category of the fifth Schizophrenia Research Foundation’s (SCARF) ‘Media for Mental Health’ awards. The story titled 'The dark hand of tech that stokes sex trafficking in India', won the Stop Slavery media Awards by the Thomson Reuters Foundation for the year 2020.