AI Content Detector Zero

AI Content Detector Zero — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • INaturalist

    INaturalist

    iNaturalist is an American 501(c)(3) nonprofit social network of naturalists, citizen scientists, and biologists built on the concept of mapping and sharing observations of biodiversity across the globe. iNaturalist may be accessed via its website or from its mobile applications. iNaturalist includes an automated species identification tool, and users further assist each other in identifying organisms from photographs and sound recordings. As of 5 August 2025, iNaturalist users had contributed nearly 300 million observations of plants, animals, fungi, and other organisms worldwide, and 400,000 users were active in the previous 30 days. iNaturalist serves as an important resource of open data for biodiversity research, conservation, and education, describing itself as "an online social network of people sharing biodiversity information to help each other learn about nature." It is the primary application for crowd-sourced biodiversity data in places such as Mexico, southern Africa, and Australia, and the project has been called "a standard-bearer for natural history mobile applications." Most of iNaturalist's software is open source. It has contributed to over 4,000 research papers and is widely used by scientists, land managers, and conservationists worldwide. The platform has also been active in the discovery of new species and rediscovery of species previously assumed to be extinct. == History == iNaturalist began in 2008 as a UC Berkeley School of Information Master's final project of Nate Agrin, Jessica Kline, and Ken-ichi Ueda. Agrin and Ueda continued work on the site with Sean McGregor, a web developer. In 2011, Ueda began collaboration with Scott Loarie, a research fellow at Stanford University and lecturer at UC Berkeley. Ueda and Loarie are the current co-directors of iNaturalist.org. The organization merged with the California Academy of Sciences on 24 April 2014. In 2017, iNaturalist became a joint initiative between the California Academy of Sciences and the National Geographic Society. With these collaborations and growing popularity of the site since 2012, the number of participants and observations has roughly doubled each year. In 2014, iNaturalist reached 1 million observations. Later, as of October 2023, there were 181 million observations (163 million verifiable). On 11 July 2023 iNaturalist announced its status as a newly independent 501(c)(3) nonprofit organization. === Google AI controversy === On 9 June 2025 Google announced that iNaturalist would be part of its "Generative AI Accelerator". This announcement, paired with the initial lack of information on the iNaturalist site, led to outcry from many iNaturalist users in the blog comments and forum, worrying about the consequences for the environment, volunteer engagement, reliability and raised questions about the decision making within iNaturalist, while some saw the backlash as a sign that people want to resist 'corrosive technologies'. PZ Myers, a biology professor who uses iNaturalist in his teaching, published an article on his website Pharyngula stating that "any decision that drives people away and replaces them with a hallucinating bot is a bad decision". == Platforms == Users can interact with iNaturalist in the following ways: through the iNaturalist.org website, through two mobile apps: iNaturalist (iOS/Android) and Seek by iNaturalist (iOS/Android), or through partner organizations such as the Global Biodiversity Information Facility (GBIF) website. On the iNaturalist.org website, visitors can search the public dataset and interact with other people adding observations and identifications. The website provides tools for registered users to add, identify, and discuss observations, write journal posts, explore information about species, create project pages to recruit participation, and coordinate work on their topics of interest. On the iNaturalist mobile app, users can create and share nature observations to the online dataset, explore observations both nearby and around the world, and learn about different species. Seek by iNaturalist, a separate app marketed to families, requires no online account registration and all observations may remain private. Seek incorporates features of gamification, such as providing a list of nearby organisms to find and encouraging the collection of badges and participation in challenges. Seek was initially released in the spring of 2018. == Observations == The iNaturalist platform is based on crowdsourcing of observations and identifications. An iNaturalist observation records a person's encounter with an individual organism at a particular time and place. An iNaturalist observation may also record evidence of an organism, such as animal tracks, nests, or scat. The scope of iNaturalist excludes natural but inert subjects such as geologic or hydrologic features. Users typically upload photos as evidence of their findings, though audio recordings are also accepted, and such evidence is not a strict requirement. Users may share observation locations publicly, "obscure" them to display a less precise location or make the locations completely private. iNaturalist users can add identifications to each other's observations in order to confirm or improve the identification of the observation. Observations are classified as "Casual", "Needs ID" (needs identification), or "Research Grade" based on the quality of the data provided and the community identification process. Any quality of data can be downloaded from iNaturalist and "Research Grade" observations are often incorporated into other online databases such as the Global Biodiversity Information Facility and the Atlas of Living Australia. === Automated species identification === In addition to observations being identified by others in the community, iNaturalist includes an automated species identification tool, first released in 2017. Images can be identified via a computer vision model which has been trained on the large database of the observations on iNaturalist. Multiple species suggestions are typically provided with the suggestion that the software guesses to be most likely is at the top of the list. A broader taxon such as a genus or family is commonly provided if the model is unsure of the species. It is trained once or twice a year, and the threshold for species included in the training set has changed over time. It can be difficult for the model to guess correctly if the species in question is infrequently observed or hard to identify from images alone, or if the image submitted has poor lighting, is blurry, or contains multiple subjects. In February 2023, iNaturalist released v2.1 of its computer vision model, which was trained on a new source model which performed significantly better than the previous models trained using a different source model. In April 2025 iNaturalist released an updated app for iOS, changing the original version to "iNaturalist Classic." == Projects == Users have created and contributed to tens of thousands of different projects on iNaturalist. The platform is commonly used to record observations during bioblitzes, which are biological surveying events that attempt to record all the species that occur within a designated area, and a specific project type on iNaturalist. Other project types include collections of observations by location or taxon or documenting specific types of observations such as animal tracks and signs, the spread of invasive species, roadkill, fishing catches, or discovering new species. In 2011, iNaturalist was used as a platform to power the Global Amphibian and Global Reptile BioBlitzes, in which observations were used to help monitor the occurrence and distribution of the world's reptiles and amphibian species. The US National Park Service partnered with iNaturalist to record observations from the 2016 National Parks BioBlitz. That project exceeded 100,000 observations in August 2016. In 2017, the United Nations Environment Programme teamed up with iNaturalist to celebrate World Environment Day.. In 2022, Reef Ecologic teamed up with iNaturalist to celebrate World Oceans Day. === City Nature Challenge === In 2016, Lila Higgins from the Natural History Museum of Los Angeles County and Alison Young from the California Academy of Sciences co-founded the City Nature Challenge (CNC). In the first City Nature Challenge, naturalists in Los Angeles and the San Francisco Bay Area documented over 20,000 observations with the iNaturalist platform. In 2017, the CNC expanded to 16 cities across the United States and collected over 125,000 observations of wildlife in 5 days. The CNC expanded to a global audience in 2018, with 68 cities participating from 19 countries, with some cities using community science platforms other than iNaturalist to participate. In 4 days, over 17,000 people cataloged over 440,000 nature observations in urban regions around the world. In 2019, the CNC once again expanded, with 35,000 parti

    Read more →
  • Cross-language information retrieval

    Cross-language information retrieval

    Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. The term "cross-language information retrieval" has many synonyms, of which the following are perhaps the most frequent: cross-lingual information retrieval, translingual information retrieval, multilingual information retrieval. The term "multilingual information retrieval" refers more generally both to technology for retrieval of multilingual collections and to technology which has been moved to handle material in one language to another. The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects (text, and other media) of various languages, translated into the user's language. Cross-language information retrieval refers more specifically to the use case where users formulate their information need in one language and the system retrieves relevant documents in another. To do so, most CLIR systems use various translation techniques. CLIR techniques can be classified into different categories based on different translation resources: Dictionary-based CLIR techniques Parallel corpora based CLIR techniques Comparable corpora based CLIR techniques Machine translator based CLIR techniques CLIR systems have improved so much that the most accurate multi-lingual and cross-lingual adhoc information retrieval systems today are nearly as effective as monolingual systems. Other related information access tasks, such as media monitoring, information filtering and routing, sentiment analysis, and information extraction require more sophisticated models and typically more processing and analysis of the information items of interest. Much of that processing needs to be aware of the specifics of the target languages it is deployed in. Mostly, the various mechanisms of variation in human language pose coverage challenges for information retrieval systems: texts in a collection may treat a topic of interest but use terms or expressions which do not match the expression of information need given by the user. This can be true even in a mono-lingual case, but this is especially true in cross-lingual information retrieval, where users may know the target language only to some extent. The benefits of CLIR technology for users with poor to moderate competence in the target language has been found to be greater than for those who are fluent. Specific technologies in place for CLIR services include morphological analysis to handle inflection, decompounding or compound splitting to handle compound terms, and translations mechanisms to translate a query from one language to another. The first workshop on CLIR was held in Zürich during the SIGIR-96 conference. Workshops have been held yearly since 2000 at the meetings of the Cross Language Evaluation Forum (CLEF). Researchers also convene at the annual Text Retrieval Conference (TREC) to discuss their findings regarding different systems and methods of information retrieval, and the conference has served as a point of reference for the CLIR subfield. Early CLIR experiments were conducted at TREC-6, held at the National Institute of Standards and Technology (NIST) on November 19–21, 1997. Google Search had a cross-language search feature that was removed in 2013.

    Read more →
  • Image destriping

    Image destriping

    Image destriping is the process of removing stripes or streaks from images and videos without disrupting the original image/video. These artifacts plague a range of fields in scientific imaging including atomic force microscopy, light sheet fluorescence microscopy, and planetary satellite imaging. The most common image processing techniques to reduce stripe artifacts is with Fourier filtering. Unfortunately, filtering methods risk altering or suppressing useful image data. Methods developed for multiple-sensor imaging systems in planetary satellites use statistical-based methods to match signal distribution across multiple sensors. More recently, a new class of approaches leverage compressed sensing, to regularize an optimization problem, and recover stripe free images. In many cases, these destriped images have little to no artifacts, even at low signal to noise ratios.

    Read more →
  • Legal information retrieval

    Legal information retrieval

    Legal information retrieval is the science of information retrieval applied to legal text, including legislation, case law, and scholarly works. Accurate legal information retrieval is important to provide access to the law to laymen and legal professionals. Its importance has increased because of the vast and quickly increasing amount of legal documents available through electronic means. Legal information retrieval is a part of the growing field of legal informatics. In a legal setting, it is frequently important to retrieve all information related to a specific query. However, commonly used boolean search methods (exact matches of specified terms) on full text legal documents have been shown to have an average recall rate as low as 20 percent, meaning that only 1 in 5 relevant documents are actually retrieved. In that case, researchers believed that they had retrieved over 75% of relevant documents. This may result in failing to retrieve important or precedential cases. In some jurisdictions this may be especially problematic, as legal professionals are ethically obligated to be reasonably informed as to relevant legal documents. Legal Information Retrieval attempts to increase the effectiveness of legal searches by increasing the number of relevant documents (providing a high recall rate) and reducing the number of irrelevant documents (a high precision rate). This is a difficult task, as the legal field is prone to jargon, polysemes (words that have different meanings when used in a legal context), and constant change. Techniques used to achieve these goals generally fall into three categories: boolean retrieval, manual classification of legal text, and natural language processing of legal text. == Problems == Application of standard information retrieval techniques to legal text can be more difficult than application in other subjects. One key problem is that the law rarely has an inherent taxonomy. Instead, the law is generally filled with open-ended terms, which may change over time. This can be especially true in common law countries, where each decided case can subtly change the meaning of a certain word or phrase. Legal information systems must also be programmed to deal with law-specific words and phrases. Though this is less problematic in the context of words which exist solely in law, legal texts also frequently use polysemes, words may have different meanings when used in a legal or common-speech manner, potentially both within the same document. The legal meanings may be dependent on the area of law in which it is applied. For example, in the context of European Union legislation, the term "worker" has four different meanings: Any worker as defined in Article 3(a) of Directive 89/391/EEC who habitually uses display screen equipment as a significant part of his normal work. Any person employed by an employer, including trainees and apprentices but excluding domestic servants; Any person carrying out an occupation on board a vessel, including trainees and apprentices, but excluding port pilots and shore personnel carrying out work on board a vessel at the quayside; Any person who, in the Member State concerned, is protected as an employee under national employment law and in accordance with national practice; It also has the common meaning: A person who works at a specific occupation. Though the terms may be similar, correct information retrieval must differentiate between the intended use and irrelevant uses in order to return the correct results. Even if a system overcomes the language problems inherent in law, it must still determine the relevancy of each result. In the context of judicial decisions, this requires determining the precedential value of the case. Case decisions from senior or superior courts may be more relevant than those from lower courts, even where the lower court's decision contains more discussion of the relevant facts. The opposite may be true, however, if the senior court has only a minor discussion of the topic (for example, if it is a secondary consideration in the case). An information retrieval system must also be aware of the authority of the jurisdiction. A case from a binding authority is most likely of more value than one from a non-binding authority. Additionally, the intentions of the user may determine which cases they find valuable. For instance, where a legal professional is attempting to argue a specific interpretation of law, he might find a minor court's decision which supports his position more valuable than a senior courts position which does not. He may also value similar positions from different areas of law, different jurisdictions, or dissenting opinions. Overcoming these problems can be made more difficult because of the large number of cases available. The number of legal cases available via electronic means is constantly increasing (in 2003, US appellate courts handed down approximately 500 new cases per day), meaning that an accurate legal information retrieval system must incorporate methods of both sorting past data and managing new data. == Techniques == === Boolean searches === Boolean searches, where a user may specify terms such as use of specific words or judgments by a specific court, are the most common type of search available via legal information retrieval systems. They are widely implemented but overcome few of the problems discussed above. The recall and precision rates of these searches vary depending on the implementation and searches analyzed. One study found a basic boolean search's recall rate to be roughly 20%, and its precision rate to be roughly 79%. Another study implemented a generic search (that is, not designed for legal uses) and found a recall rate of 56% and a precision rate of 72% among legal professionals. Both numbers increased when searches were run by non-legal professionals, to a 68% recall rate and 77% precision rate. This is likely explained because of the use of complex legal terms by the legal professionals. === Manual classification === In order to overcome the limits of basic boolean searches, information systems have attempted to classify case laws and statutes into more computer friendly structures. Usually, this results in the creation of an ontology to classify the texts, based on the way a legal professional might think about them. These attempt to link texts on the basis of their type, their value, and/or their topic areas. Most major legal search providers now implement some sort of classification search, such as Westlaw's “Natural Language” or LexisNexis' Headnote searches. Additionally, both of these services allow browsing of their classifications, via Westlaw's West Key Numbers or Lexis' Headnotes. Though these two search algorithms are proprietary and secret, it is known that they employ manual classification of text (though this may be computer-assisted). These systems can help overcome the majority of problems inherent in legal information retrieval systems, in that manual classification has the greatest chances of identifying landmark cases and understanding the issues that arise in the text. In one study, ontological searching resulted in a precision rate of 82% and a recall rate of 97% among legal professionals. The legal texts included, however, were carefully controlled to just a few areas of law in a specific jurisdiction. The major drawback to this approach is the requirement of using highly skilled legal professionals and large amounts of time to classify texts. As the amount of text available continues to increase, some have stated their belief that manual classification is unsustainable. === Natural language processing === In order to reduce the reliance on legal professionals and the amount of time needed, efforts have been made to create a system to automatically classify legal text and queries. Adequate translation of both would allow accurate information retrieval without the high cost of human classification. These automatic systems generally employ Natural Language Processing (NLP) techniques that are adapted to the legal domain, and also require the creation of a legal ontology. Though multiple systems have been postulated, few have reported results. One system, “SMILE,” which attempted to automatically extract classifications from case texts, resulted in an f-measure (which is a calculation of both recall rate and precision) of under 0.3 (compared to perfect f-measure of 1.0). This is probably much lower than an acceptable rate for general usage. Despite the limited results, many theorists predict that the evolution of such systems will eventually replace manual classification systems. === Citation-Based ranking === In the mid-90s the Room 5 case law retrieval project used citation mining for summaries and ranked its search results based on citation type and count. This slightly pre-dated the PageRank algorithm at Stanford which was also a citation-based ranking. Ranking of results was based

    Read more →
  • Singularity studies

    Singularity studies

    Singularity studies is an interdisciplinary academic field which examines the idea of technological singularity — the hypothesised point at which artificial intelligence may surpass human intelligence, might be attained by artificial intelligence (AI), robotics, and other technologies and sciences, and its social impacts. In this academic field, the study and research are conducted across a broad array of terrains such as information science, robotics, social informatics, economics, philosophy, and ethics. The primary aim of singularity studies is to gain an integrative understanding of the transformation of social systems occurring in tandem with the explosive evolution of AI and also the changes to be effected by such transformation in the view of humans, ethics, and legal systems. == History == An academic work on technological singurality has appeared in computer science, philosophy, sociology, and law since the early 1990s. Early discussions of an intelligence explosion were popularised by science-fiction writer Vernor Vinge in 1993 and later systematised by futurist Ray Kurzweil. Since the 2010s, universities such as Oxford, Stanford, and Keio have established dedicated programmes, while peer-reviewed journals have begun to publish scenario analyses and policy studies. Ongoing debates question the predictive value of singularity scenarios and warn against a deterministic view of technology. == Characteristics of research == Singularity studies extends beyond mere future predictions and offer an intellectual foundation for proactively designing and creating a desirable future. Principal research themes in this realm include: Ethics of AI; Social implications of technologies; Possibility of harmonious coexistence of humans and AI; Communication with AI; and Redesign of social systems. == Technologists and academics == Vernor Vinge: Propounded the concept of singularity in 1993, making a massive impact on the academic and science-fiction spheres. Ray Kurzweil: Predicted the advent around 2045 of the technological singularity in his 2005 book The Singularity Is Near. Nick Bostrom: Offered philosophical reflections on superintelligence and the risks posed by AI. He is the founding director of the now-dissolved Future of Humanity Institute at the University of Oxford. === Japan === Kento Sasano: A social informatician, AI educator, and inventor. He is the president of the Japan Society of Singularity Studies. == Challenges and outlook == Singularity studies is still evolving as an academic field, and quite a few challenges remain unresolved in regard to the systematization of their theories, research methods, and educational curricula. That said, in this day and age of accelerating technological and societal shifts, interdisciplinary approaches have gained in importance and are drawing much attention in the arenas of scholarly research, intercorporate collaboration, and policy planning.

    Read more →
  • Sentence embedding

    Sentence embedding

    In natural language processing, a sentence embedding is a representation of a sentence as a vector of numbers which encodes meaningful semantic information. State of the art embeddings are based on the learned hidden layer representation of dedicated sentence transformer models. BERT pioneered an approach involving the use of a dedicated [CLS] token prepended to the beginning of each sentence inputted into the model; the final hidden state vector of this token encodes information about the sentence and can be fine-tuned for use in sentence classification tasks. In practice however, BERT's sentence embedding with the [CLS] token achieves poor performance, often worse than simply averaging non-contextual word embeddings. SBERT later achieved superior sentence embedding performance by fine tuning BERT's [CLS] token embeddings through the usage of a siamese neural network architecture on the SNLI dataset. Other approaches are loosely based on the idea of distributional semantics applied to sentences. Skip-Thought trains an encoder-decoder structure for the task of neighboring sentences predictions; this has been shown to achieve worse performance than approaches such as InferSent or SBERT. An alternative direction is to aggregate word embeddings, such as those returned by Word2vec, into sentence embeddings. The most straightforward approach is to simply compute the average of word vectors, known as continuous bag-of-words (CBOW). However, more elaborate solutions based on word vector quantization have also been proposed. One such approach is the vector of locally aggregated word embeddings (VLAWE), which demonstrated performance improvements in downstream text classification tasks. == Applications == In recent years, sentence embedding has seen a growing level of interest due to its applications in natural language queryable knowledge bases through the usage of vector indexing for semantic search. LangChain for instance utilizes sentence transformers for purposes of indexing documents. In particular, an indexing is generated by generating embeddings for chunks of documents and storing (document chunk, embedding) tuples. Then given a query in natural language, the embedding for the query can be generated. A top k similarity search algorithm is then used between the query embedding and the document chunk embeddings to retrieve the most relevant document chunks as context information for question answering tasks. This approach is also known formally as retrieval-augmented generation. Though not as predominant as BERTScore, sentence embeddings are commonly used for sentence similarity evaluation which sees common use for the task of optimizing a Large language model's generation parameters is often performed via comparing candidate sentences against reference sentences. By using the cosine-similarity of the sentence embeddings of candidate and reference sentences as the evaluation function, a grid-search algorithm can be utilized to automate hyperparameter optimization. == Evaluation == A way of testing sentence encodings is to apply them on Sentences Involving Compositional Knowledge (SICK) corpus for both entailment (SICK-E) and relatedness (SICK-R). In the best results are obtained using a BiLSTM network trained on the Stanford Natural Language Inference (SNLI) Corpus. The Pearson correlation coefficient for SICK-R is 0.885 and the result for SICK-E is 86.3. A slight improvement over previous scores is presented in: SICK-R: 0.888 and SICK-E: 87.8 using a concatenation of bidirectional Gated recurrent unit.

    Read more →
  • Language identification

    Language identification

    In natural language processing, language identification or language guessing is the problem of determining which natural language a given content is in. Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods. == Overview == === Logical approach === A common non-statistical intuitive approach (though highly uncertain) is to look for common letter combinations, or distinctive diacritics or punctuation. === Statistical approach === There are several statistical approaches to language identification. An older statistical method by Grefenstette was based on the frequency of short n-grams, which are often function morphemes. For example, "ing" is more common in English than in French, while the sequence "que" is more common in French. Given a new page found on the Web, one counts the number of occurrences of each such short sequence and picks the language whose frequency table it matches the most. One technique is to compare the compressibility of the text to the compressibility of texts in a set of known languages. This approach is known as mutual information based distance measure. The same technique can also be used to empirically construct family trees of languages which closely correspond to the trees constructed using historical methods. Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques. Another technique, as described by Cavnar and Trenkle (1994) and Dunning (1994) is to create a language n-gram model from a "training text" for each of the languages. These models can be based on characters (Cavnar and Trenkle) or encoded bytes (Dunning); in the latter, language identification and character encoding detection are integrated. Then, for any piece of text needing to be identified, a similar model is made, and that model is compared to each stored language model. The most likely language is the one with the model that is most similar to the model from the text needing to be identified. This approach can be problematic when the input text is in a language for which there is no model. In that case, the method may return another, "most similar" language as its result. Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web. As of 2025, a commonly used baseline method is via the fastText library, which has comparable classification accuracy as deep learning techniques, but much faster. == Identifying similar languages == One of the great bottlenecks of language identification systems is to distinguish between closely related languages. Similar languages like Bulgarian and Macedonian or Indonesian and Malay present significant lexical and structural overlap, making it challenging for systems to discriminate between them. In 2014 the DSL shared task has been organized providing a dataset (Tan et al., 2014) containing 13 different languages (and language varieties) in six language groups: Group A (Bosnian, Croatian, Serbian), Group B (Indonesian, Malaysian), Group C (Czech, Slovak), Group D (Brazilian Portuguese, European Portuguese), Group E (Peninsular Spanish, Argentine Spanish), Group F (American English, British English). The best system reached performance of over 95% results (Goutte et al., 2014). Results of the DSL shared task are described in Zampieri et al. 2014. == Software == Apache OpenNLP includes char n-gram based statistical detector and comes with a model that can distinguish 103 languages Apache Tika contains a language detector for 18 languages

    Read more →
  • CLAWS (linguistics)

    CLAWS (linguistics)

    The Constituent Likelihood Automatic Word-tagging System (CLAWS) is a program that performs part-of-speech tagging. It was developed in the 1980s at Lancaster University by the University Centre for Computer Corpus Research on Language. It has an overall accuracy rate of 96–97% with the latest version (CLAWS4) tagging around 100 million words of the British National Corpus. == History == A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. Developed in the early 1980s, CLAWS was built to fill the ever-growing gap created by always-changing POS necessities. Originally created to add part-of-speech tags to the LOB corpus of British English, the CLAWS tagset has since been adapted to other languages as well, including Urdu and Arabic. Since its inception, CLAWS has been hailed for its functionality and adaptability. Still, it is not without flaws, and though it boasts an error-rate of only 1.5% when judged in major categories, CLAWS still remains with c.3.3% ambiguities unresolved. Ambiguity arises in cases such as with the word flies, and whether it should be classified as a noun or a verb. It's these ambiguities that will require the various upgrades and tagsets that CLAWS will endure. == Rules and processing == CLAWS uses a Hidden Markov model to determine the likelihood of sequences of words in anticipating each part-of-speech label. === Sample output === This excerpt from Bram Stoker's Dracula (1897) has been tagged using both the CLAWS C5 and C7 tagsets. This is what a CLAWS output will generally look like, with the most likely part-of-speech tag following each word. == Tagsets == === CLAWS1 tagset === The first tagset developed in CLAWS, CLAWS1 tagset, has 132 word tags. In terms of form and application, C1 tagset is similar to Brown Corpus tags. See Table of tags in C1 tagset here. === CLAWS2 tagset === From 1983 to 1986, updated versions leading to CLAWS2 were part of a larger attempt to deal with aspects such as recognizing sentence breaks, in order to avoid the need for manual pre-processing of a text before the tags were applied, moving instead to optional manual post-editing to adjust the output of the automatic annotation, if needed. The CLAWS2 tagset has 166 word tags. See Table of tags in C2 tagset here. === CLAWS4 tagset === The CLAWS4 was used for the 100-million-word British National Corpus (BNC). A general-purpose grammatical tagger, it is a successor of the CLAWS1 tagger. In tagging the BNC, the many rounds of work that went into CLAWS4 focused on making the CLAWS program independent from the tagsets. For example, the BNC project used two tagset versions: "a main tagset (C5) with 62 tags with which the whole of the corpus has been tagged, and a larger (C7) tagset with 152 tags, which has been used to make a selected 'core' sample corpus of two million words." The latest version of CLAWS4 is offered by UCREL, a research center of Lancaster University. === CLAWS5 tagset === The CLAWS5 tagset, which was used for BNC, has over 60 tags. See Table of tags in C5 tagset here. === CLAWS6 tagset === The CLAWS6 tagset was used for the BNC sampler corpus and the COLT corpus. It has over 160 tags, including 13 determiner subtypes. See Table of tags in C6 tagset here. === CLAWS7 tagset === The standard CLAWS7 tagset is used currently. It is only different in the punctuation tags when compared to the CLAWS6 tagset. See Table of tags in C7 tagset here. === CLAWS8 tagset === CLAWS8 tagset was extended from C7 tagset with further distinctions in the determiner and pronoun categories, as well as 37 new auxiliary tags for forms of be, do, and have. See Table of tags in C8 tagset here

    Read more →
  • Software component

    Software component

    A software component is a modular unit of software that encapsulates specific functionality. The desired characteristics of a component are reusability and maintainability. == Value == Components allow software developers to assemble software with reliable parts rather than writing code for every aspect. It makes implementation more like factory assembly than custom building. == Attributes == Desirable attributes of a component include but are not limited to: Cohesive – encapsulates related functionality Reusable Robust Substitutable – can be replaced by another component with the same interface Documented Tested == Third-party == Some components are built in-house by the same organization or team building the software system. Some are third-party, developed elsewhere and assembled into the software system. == Component-based software engineering == For large-scale systems, component-based development encourages a disciplined process to manage complexity. == Framework == Some components conform to a framework technology that allows them to be consumed in a well-known way. Examples include: CORBA, COM, Enterprise JavaBeans, and the .NET Framework. == Modeling == Component design is often modeled visually. In Unified Modeling Language (UML) 2.0 a component is shown as a rectangle, and an interface is shown as a lollipop to indicate a provided interface and as a socket to indicate consumption of an interface. == History == The idea of reusable software components was promoted by Douglas McIlroy in his presentation at the NATO Software Engineering Conference of 1968. (One goal of that conference was to resolve the so-called software crisis of the time.) In the 1970s, McIlroy put this idea into practice with the addition of the pipeline feature to the Unix operating system. Brad Cox refined the concept of a software component in the 1980s. He attempted to create an infrastructure and market for reusable third-party components by inventing the Objective-C programming language. IBM introduced System Object Model (SOM) in the early 1990s. Microsoft introduced Component Object Model (COM) in the early 1990s. Microsoft built many domain-specific component technologies on COM, including Distributed Component Object Model (DCOM), Object Linking and Embedding (OLE), and ActiveX.

    Read more →
  • Healthy Together

    Healthy Together

    Healthy Together is a health technology company that provides software for Health & Humans Services Departments. Healthy Together supports a “One Door” approach to eligibility, enrollment, and management for programs like Medicaid, Supplemental Nutrition Assistance Program, TANF and WIC, as well as behavioral health (988), disease surveillance, vital records, child welfare and more. The platform's use is to increase the reach and efficacy of program initiatives, improve health equity and reduce cost. Software is available in the United States of America with current deployments in Florida, Oklahoma. The United States Department of Veterans Affairs also utilizes Healthy Together's mobile platform. == Development == Healthy Together launched in March 2020 and builds software for public health and health and human services departments. The Florida Department of Health began using the platform in September 2020 to deliver real-time test results to residents. Over 50% of households in Florida have adopted the mobile application. On December 6, 2022, the Advanced Technology Academic Research Center (ATARC) awarded Healthy Together and the State of Florida's Department of Health with a Digital Experience Award at their 2022 GITEC Emerging Technology Award Ceremony in Washington, D.C. to recognize success of the project. The partnership was also highlighted on the Federal News Network's show Federal Drive. The platform is also used at universities in Oklahoma. In November 2022, the United States Department of Veterans Affairs and Healthy Together announced a collaboration to expand access to health records for Veterans. The platform provides 18 million Veterans with access to their health information through their smartphones and mobile devices. In December 2022, the integration was recognized as one of Healthcare IT News' Top 10 stories of 2022.

    Read more →
  • Automatic acquisition of lexicon

    Automatic acquisition of lexicon

    Automatic acquisition of lexicon is a computerized process used for the development of a complex morphological lexicon of a language. The lexicon is essential for the NLP (Natural language processing), as well as a prerequisite to any wide-coverage parser. The two main requirements represent raw corpus and the morphological description of the language. The aim is to provide lemmas that will serve to the explanation of all the words that occur within the corpus. For the achievement of a quality lexicon it is necessary to manually validate the generated lemmas and iterate the whole process several times. The process is focused on the open word classes (e.g. nouns, adjectives, verbs). Closed classes (e.g. prepositions, pronouns, numerals) are excluded. This method is applicable to the languages with a rich morphology, such as Slovak, Russian or Croatian. Applied to Slovak, being an inflectional language, the automatic acquisition focuses on the inflectional morphology as well as on the derivational morphology. This fact enables the users to find out the information about derivational relations (e.g. adjectivizations, prefixes) in the lexicon. For example, Slovak word korpusový is an adjectivization of korpus (eng. corpus). == Three-step loop == Conformably to Benoît Sagot, there are three stages involved in the acquisition of lemmas: Generation and inflection Ranking Manual validation The more iteration will be performed, the more accurate lexicon will be obtained. For each iteration are essential the information given by a manual validator. === Generation and inflection === Firstly, all words which represent the closed word classes (pronouns, prepositions, numerals) are manually excluded from the given corpus. Number of their occurrences in the corpus is provided. Then the automatic generation comes, when the hypothetical lemmas according to the morphological description of a language are created. Generated lemmas are consequently being inflected, so that all of their inflected forms are built. Obtained forms are associated with the corresponding lemma and a morphological tag. === Ranking === There was created a probabilistic model, represented by a fix-point algorithm, to rank the hypothetical lemmas generated in the first step. Best ranked lemmas are expected to be ideally all correct, whereas the least ranked tend to be incorrect. === Manual validation === Correctness of the best- ranked lemmas created in the previous step are checked by the manual validator, who should be a native speaker. Lemmas are at this stage divided into three categories: valid lemmas, appended to lexicon erroneous lemmas generated by valid forms (later associated to another lemmas) erroneous lemmas generated by invalid forms (these need to be excluded) == Future development == Automatic acquisition, in comparison to a purely manual development of the lexicons, seems to be promising, considering the future development, because of the short validation time needed and the relatively small amount of human labor involved.

    Read more →
  • N-jet

    N-jet

    An N-jet is the set of (partial) derivatives of a function f ( x ) {\displaystyle f(x)} up to order N. Specifically, in the area of computer vision, the N-jet is usually computed from a scale space representation L {\displaystyle L} of the input image f ( x , y ) {\displaystyle f(x,y)} , and the partial derivatives of L {\displaystyle L} are used as a basis for expressing various types of visual modules. For example, algorithms for tasks such as feature detection, feature classification, stereo matching, tracking and object recognition can be expressed in terms of N-jets computed at one or several scales in scale space.

    Read more →
  • TensorFlow Hub

    TensorFlow Hub

    TensorFlow Hub (also styled TF Hub) is an open-source machine learning library and online repository that provides TensorFlow model components, called modules. It is maintained by Google as part of the TensorFlow ecosystem and allows developers to discover, publish, and reuse pretrained models for tasks such as computer vision, natural language processing, and transfer learning. == Overview == TensorFlow Hub provides a central platform where developers and researchers can access pre-trained models and integrate them directly into TensorFlow workflows. Each module encapsulates a computation graph and its trained weights, with standardized input and output signatures. Modules can be loaded using the hub.load() function or through Keras integration via hub.KerasLayer, enabling users to perform transfer learning or feature extraction. == History == TensorFlow Hub was announced by Google in March 2018, with the first public version released shortly after. Its introduction coincided with the growing adoption of transfer learning techniques and the need for standardized model packaging. Over time, the hub expanded to include models such as the BERT family, MobileNet, EfficientNet, and the Universal Sentence Encoder. In 2020, research on “Regret selection in TensorFlow Hub” explored the problem of identifying optimal models for downstream tasks given a large repository of alternatives. == Applications == TensorFlow Hub hosts a variety of models across machine learning domains: Natural language processing: BERT, ALBERT language model, and Universal Sentence Encoder. Computer vision: ResNet, Inception (deep learning), MobileNet, EfficientNet. Speech and audio: spectrogram feature extractors and automatic speech recognition models. Multilingual embeddings: cross-lingual and sentence-level representations for machine translation and semantic similarity. Modules are widely used in education, academic research, and industry for prototyping and production deployment.

    Read more →
  • Pill reminder

    Pill reminder

    A pill reminder is any device that reminds users to take medications. Traditional pill reminders are pill containers with electric timers attached, which can be preset for certain times of the day to set off an alarm. More sophisticated pill reminders can also detect when they have been opened, and therefore when the user is away during the time they were supposed to take their medication, they will be reminded of it when they return. This reminder can be in the form of a light, which also helps for deaf or hearing-impaired users. == Mobile app == A newer type of pill reminder is a mobile app that reminds the owner to take the medication. Some of these applications might effectively support adherence to taking medications.

    Read more →
  • XLNet

    XLNet

    The XLNet was an autoregressive Transformer designed as an improvement over BERT, with 340M parameters and trained on 33 billion words. It was released on 19 June 2019, under the Apache 2.0 license. It achieved state-of-the-art results on a variety of natural language processing tasks, including language modeling, question answering, and natural language inference. == Architecture == The main idea of XLNet is to model language autoregressively like the GPT models, but allow for all possible permutations of a sentence. Concretely, consider the following sentence:My dog is cute.In standard autoregressive language modeling, the model would be tasked with predicting the probability of each word, conditioned on the previous words as its context: We factorize the joint probability of a sequence of words x 1 , … , x T {\displaystyle x_{1},\ldots ,x_{T}} using the chain rule: Pr ( x 1 , … , x T ) = Pr ( x 1 ) Pr ( x 2 | x 1 ) Pr ( x 3 | x 1 , x 2 ) … Pr ( x T | x 1 , … , x T − 1 ) . {\displaystyle \Pr(x_{1},\ldots ,x_{T})=\Pr(x_{1})\Pr(x_{2}|x_{1})\Pr(x_{3}|x_{1},x_{2})\ldots \Pr(x_{T}|x_{1},\ldots ,x_{T-1}).} For example, the sentence "My dog is cute" is factorized as: Pr ( My , dog , is , cute ) = Pr ( My ) Pr ( dog | My ) Pr ( is | My , dog ) Pr ( cute | My , dog , is ) . {\displaystyle \Pr({\text{My}},{\text{dog}},{\text{is}},{\text{cute}})=\Pr({\text{My}})\Pr({\text{dog}}|{\text{My}})\Pr({\text{is}}|{\text{My}},{\text{dog}})\Pr({\text{cute}}|{\text{My}},{\text{dog}},{\text{is}}).} Schematically, we can write it as → My → My dog → My dog is → My dog is cute . {\displaystyle {\texttt {}}{\texttt {}}{\texttt {}}{\texttt {}}\to {\text{My }}{\texttt {}}{\texttt {}}{\texttt {}}\to {\text{My dog }}{\texttt {}}{\texttt {}}\to {\text{My dog is }}{\texttt {}}\to {\text{My dog is cute}}.} However, for XLNet, the model is required to predict the words in a randomly generated order. Suppose we have sampled a randomly generated order 3241, then schematically, the model is required to perform the following prediction task: is dog is dog is cute → My dog is cute {\displaystyle {\texttt {}}{\texttt {}}{\texttt {}}{\texttt {}}\to {\texttt {}}{\texttt {}}{\text{is }}{\texttt {}}\to {\texttt {}}{\text{dog is }}{\texttt {}}\to {\texttt {}}{\text{dog is cute}}\to {\text{My dog is cute}}} By considering all permutations, XLNet is able to capture longer-range dependencies and better model the bidirectional context of words. === Two-Stream Self-Attention === To implement permutation language modeling, XLNet uses a two-stream self-attention mechanism. The two streams are: Content stream: This stream encodes the content of each word, as in standard causally masked self-attention. Query stream: This stream encodes the content of each word in the context of what has gone before. In more detail, it is a masked cross-attention mechanism, where the queries are from the query stream, and the key-value pairs are from the content stream. The content stream uses the causal mask M causal = [ 0 − ∞ − ∞ … − ∞ 0 0 − ∞ … − ∞ 0 0 0 … − ∞ ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 … 0 ] {\displaystyle M_{\text{causal}}={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}} permuted by a random permutation matrix to P M causal P − 1 {\displaystyle PM_{\text{causal}}P^{-1}} . The query stream uses the cross-attention mask P ( M causal − ∞ I ) P − 1 {\displaystyle P(M_{\text{causal}}-\infty I)P^{-1}} , where the diagonal is subtracted away specifically to avoid the model "cheating" by looking at the content stream for what the current masked token is. Like the causal masking for GPT models, this two-stream masked architecture allows the model to train on all tokens in one forward pass. == Training == Two models were released: XLNet-Large, cased: 110M parameters, 24-layer, 1024-hidden, 16-heads XLNet-Base, cased: 340M parameters, 12-layer, 768-hidden, 12-heads. It was trained on a dataset that amounted to 32.89 billion tokens after tokenization with SentencePiece. The dataset was composed of BooksCorpus, and English Wikipedia, Giga5, ClueWeb 2012-B, and Common Crawl. It was trained on 512 TPU v3 chips, for 5.5 days. At the end of training, it still under-fitted the data, meaning it could have achieved lower loss with more training. It took 0.5 million steps with an Adam optimizer, linear learning rate decay, and a batch size of 8192.

    Read more →