Spark NLP

Spark NLP

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages. The library is built on top of Apache Spark and its Spark ML library. Its purpose is to provide an API for natural language processing pipelines that implement recent academic research results as production-grade, scalable, and trainable software. The library offers pre-trained neural network models, pipelines, and embeddings, as well as support for training custom models. == Features == The design of the library makes use of the concept of a pipeline which is an ordered set of text annotators. Out of the box annotators include, tokenizer, normalizer, stemming, lemmatizer, regular expression, TextMatcher, chunker, DateMatcher, SentenceDetector, DeepSentenceDetector, POS tagger, ViveknSentimentDetector, sentiment analysis, named entity recognition, conditional random field annotator, deep learning annotator, spell checking and correction, dependency parser, typed dependency parser, document classification, and language detection. The Models Hub is a platform for sharing open-source as well as licensed pre-trained models and pipelines. It includes pre-trained pipelines with tokenization, lemmatization, part-of-speech tagging, and named entity recognition that exist for more than thirteen languages; word embeddings including GloVe, ELMo, BERT, ALBERT, XLNet, Small BERT, and ELECTRA; sentence embeddings including Universal Sentence Embeddings (USE) and Language Agnostic BERT Sentence Embeddings (LaBSE). It also includes resources and pre-trained models for more than two hundred languages. Spark NLP base code includes support for East Asian languages such as tokenizers for Chinese, Japanese, Korean; for right-to-left languages such as Urdu, Farsi, Arabic, Hebrew and pre-trained multilingual word and sentence embeddings such as LaUSE and a translation annotator. == Usage in healthcare == Spark NLP for Healthcare is a commercial extension of Spark NLP for clinical and biomedical text mining. It provides healthcare-specific annotators, pipelines, models, and embeddings for clinical entity recognition, clinical entity linking, entity normalization, assertion status detection, de-identification, relation extraction, and spell checking and correction. The library offers access to several clinical and biomedical transformers: JSL-BERT-Clinical, BioBERT, ClinicalBERT, GloVe-Med, GloVe-ICD-O. It also includes over 50 pre-trained healthcare models, that can recognize the entities such as clinical, drugs, risk factors, anatomy, demographics, and sensitive data. == Spark OCR == Spark OCR is another commercial extension of Spark NLP for optical character recognition (OCR) from images, scanned PDF documents, and DICOM files. It is a software library built on top of Apache Spark. It provides several image pre-processing features for improving text recognition results such as adaptive thresholding and denoising, skew detection & correction, adaptive scaling, layout analysis and region detection, image cropping, removing background objects. Due to the tight coupling between Spark OCR and Spark NLP, users can combine NLP and OCR pipelines for tasks such as extracting text from images, extracting data from tables, recognizing and highlighting named entities in PDF documents or masking sensitive text in order to de-identify images. Several output formats are supported by Spark OCR such as PDF, images, or DICOM files with annotated or masked entities, digital text for downstream processing in Spark NLP or other libraries, structured data formats (JSON and CSV), as files or Spark data frames. Users can also distribute the OCR jobs across multiple nodes in a Spark cluster. == License and availability == Spark NLP is licensed under the Apache 2.0 license. The source code is publicly available on GitHub as well as documentation and a tutorial. Prebuilt versions of Spark NLP are available in PyPi and Anaconda Repository for Python development, in Maven Central for Java & Scala development, and in Spark Packages for Spark development. == Award == In March 2019, Spark NLP received Open Source Award for its contributions in natural language processing in Python, Java, and Scala.

Auto-defrost

Auto-defrost, automatic defrost or self-defrosting is a technique which regularly defrosts the evaporator in a refrigerator or freezer. Appliances using this technique are often called frost free, frostless, or no-frost. == Mechanism == The defrost mechanism in a refrigerator heats the cooling element (evaporator coil) for a short period of time and melts the frost that has formed on it. The resulting water drains through a duct at the back of the unit. Defrosting is controlled by an electric or electronic timer. For every 6, 8, 10, 12 or 24 hours of compressor operation, it turns on a defrost heater for 15 minutes to half an hour. The defrost heater, having a typical power rating of 350W to 600W, is often mounted just below the evaporator in top and bottom-freezer models. It can also be located below and in the middle of the evaporator in side-by-side models. It may be protected from short circuits by means of fusible links. In older refrigerators, the timer runs continuously. In newer designs, the timer only runs while the compressor runs, so the longer the refrigerator door is closed, the less time the heater will run for and the more energy is saved. A defrost thermostat opens the heater circuit when the evaporator temperature rises above a preset temperature, 40°F (5°C) or more, thereby preventing excessive heating of the freezer compartment. The defrost timer is such that either the compressor or the defrost heater is on, but not both at the same time. Inside the freezer, air is circulated by means of one or more fans. In a typical design cold air from the freezer compartment is ducted to the fresh food compartment and circulated back into the freezer compartment. Air circulation helps sublimate any ice or frost that may form on frozen items in the freezer compartment. While defrosting, this fan is stopped to prevent heated-up air from reaching the food compartment. Instead of the normal cooling elements being embedded in the freezer liner, auto-defrost elements are behind or beneath the liner. This allows them to be heated for short periods of time to dispose of frost, without heating the contents of the freezer. Alternatively, some systems use the hot gas in the condenser to defrost the evaporator. This is done by means of a circuit that is cross-linked by a three-way valve. The hot gas quickly heats up the evaporator and defrosts it. This system is primarily used in commercial applications such as ice-cream displays. == Application == While this technique was originally applied to the refrigerator compartment, it was later used for freezer compartment as well. A combined refrigerator/freezer which applies self-defrosting to the refrigerator compartment only is usually called "partial frost free" or semi-automatic defrost (some brands call these "Auto Defrost" while Frigidaire referred to their semi-automatic models as "Cycla-Matic," Kelvinator often named these models as "Cyclic Defrost" ). These refrigerators usually have a pan underneath where water from the melted frost in the refrigerator section evaporates. Freezers with automatic defrosting and combined refrigerator/freezer units which also apply self defrosting to their freezer compartment are called "frost free". The latter usually feature an air connection between the two compartments with the air passage to the refrigerator compartment regulated by a damper. By this means, a controlled portion of the air coming from the freezer reaches the refrigerator. Some older models have no air circulation between their freezer and refrigerator sections. Instead, they use an independent cooling system (for example: an evaporator coil with a defrost heater and a circulating fan in the freezer and a cold-plate or open-coil evaporator in the refrigerator. "Frost-Free" refrigerator/freezer units usually use a heating element to defrost their evaporators, a pan to collect and evaporate water from the frost that melts from the cold plate and/or evaporator coil, a timer which turns off the compressor and turns on the defrost element usually from once to 4 times a day for periods usually ranging from 15 to 30 minutes, a defrost limiter thermostat that turns off the heating element before the temperature rises too much while the timer is still in its defrost phase. Some models also feature a drain heater to prevent ice from blocking the drain. Other early types of refrigerators also use hot gas defrost instead of electric heaters. These reverse the evaporator and condenser sides for the defrost cycle. Some newer refrigerator/freezer models have a computer that monitors how many times each door is opened and uses this data to control defrost scheduling thereby reducing power use. == Advantages == No need to manually defrost the frost buildup, therefore power consumption will not increase with time. Food packaging is easier to see. Most frozen food will not stick together. Smells are limited, especially in total frost-free appliances because the air always circulates. Better temperature management. == Disadvantages == The system can be more expensive to run when usage is high and if the fan continues or starts to run when the door is opened. A thermal cutout safety device is required to prevent overheating of the heating element. Increased electrical and mechanical complexity compared to a basic upright freezer or chest freezer, making it more prone to component failure. The temperature of the freezer contents rises during the defrosting cycles, especially if there is a light load in the freezer. This can cause "freezer burn" on articles placed in the freezer, from partially defrosting, then re-freezing On hot, humid days condensation will sometimes form around the refrigerator doors. Defrosting may not be completed by the time the defrost timer cycles back to normal operation (especially in hot, humid conditions with frequent door openings), leaving ice/frost on the evaporator coils. This condition can lead to "icing" which will interfere with the operation of the refrigerator. In laboratories, self-defrosting freezers must not be used to store certain delicate reagents such as enzymes, because the temperature cycling can degrade them. In addition, water can evaporate out of containers that do not have a very tight seal, altering the concentration of the reagents. Self-defrosting freezers should never be used to store flammable chemicals.

Visopsys

Visopsys (Visual Operating System), is an operating system, written by Andy McLaughlin. Development of the operating system began in 1997. The operating system is licensed under the GNU GPL, with the headers and libraries under the less restrictive LGPL license. It runs on the 32-bit IA-32 architecture. It features a multitasking kernel, supports asynchronous I/O and the FAT line of file systems. It requires a Pentium processor. == History == The development of Visopsys began in 1997, being written by Andy McLaughlin. The first public release of the Operating System was on 2 March 2001, with version 0.1. In this release, Visopsys was a 32 bit operating system, supporting preemptive multitasking and virtual memory. == System overview == Visopsys uses a monolithic kernel, written in the C programming language, with elements of assembly language for certain interactions with the hardware. The operating system supports a graphical user interface, with a small C library.

Supercomputer operating system

A supercomputer operating system is an operating system intended for supercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in supercomputer architecture. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of Linux, with it running all the supercomputers on the TOP500 list in November 2017. In 2021, top 10 computers run for instance Red Hat Enterprise Linux (RHEL), or some variant of it or other Linux distribution e.g. Ubuntu. Given that modern massively parallel supercomputers typically separate computations from other services by using multiple types of nodes, they usually run different operating systems on different nodes, e.g., using a small and efficient lightweight kernel such as Compute Node Kernel (CNK) or Compute Node Linux (CNL) on compute nodes, but a larger system such as a Linux distribution on server and input/output (I/O) nodes. While in a traditional multi-user computer system job scheduling is in effect a tasking problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources, as well as gracefully dealing with inevitable hardware failures when tens of thousands of processors are present. Although most modern supercomputers use the Linux operating system, each manufacturer has made its own specific changes to the Linux distribution they use, and no industry standard exists, partly because the differences in hardware architectures require changes to optimize the operating system to each hardware design. == Context and overview == In the early days of supercomputing, the basic architectural concepts were evolving rapidly, and system software had to follow hardware innovations that usually took rapid turns. In the early systems, operating systems were custom tailored to each supercomputer to gain speed, yet in the rush to develop them, serious software quality challenges surfaced and in many cases the cost and complexity of system software development became as much an issue as that of hardware. In the 1980s the cost for software development at Cray came to equal what they spent on hardware and that trend was partly responsible for a move away from the in-house operating systems to the adaptation of generic software. The first wave in operating system changes came in the mid-1980s, as vendor specific operating systems were abandoned in favor of Unix. Despite early skepticism, this transition proved successful. By the early 1990s, major changes were occurring in supercomputing system software. By this time, the growing use of Unix had begun to change the way system software was viewed. The use of a high level language (C) to implement the operating system, and the reliance on standardized interfaces was in contrast to the assembly language oriented approaches of the past. As hardware vendors adapted Unix to their systems, new and useful features were added to Unix, e.g., fast file systems and tunable process schedulers. However, all the companies that adapted Unix made unique changes to it, rather than collaborating on an industry standard to create "Unix for supercomputers". This was partly because differences in their architectures required these changes to optimize Unix to each architecture. As general purpose operating systems became stable, supercomputers began to borrow and adapt critical system code from them, and relied on the rich set of secondary functions that came with them. However, at the same time the size of the code for general purpose operating systems was growing rapidly. By the time Unix-based code had reached 500,000 lines long, its maintenance and use was a challenge. This resulted in the move to use microkernels which used a minimal set of the operating system functions. Systems such as Mach at Carnegie Mellon University and ChorusOS at INRIA were examples of early microkernels. The separation of the operating system into separate components became necessary as supercomputers developed different types of nodes, e.g., compute nodes versus I/O nodes. Thus modern supercomputers usually run different operating systems on different nodes, e.g., using a small and efficient lightweight kernel such as CNK or CNL on compute nodes, but a larger system such as a Linux-derivative on server and I/O nodes. == Early systems == The CDC 6600, generally considered the first supercomputer in the world, ran the Chippewa Operating System, which was then deployed on various other CDC 6000 series computers. The Chippewa was a rather simple job control oriented system derived from the earlier CDC 3000, but it influenced the later KRONOS and SCOPE systems. The first Cray-1 was delivered to the Los Alamos Lab with no operating system, or any other software. Los Alamos developed the application software for it, and the operating system. The main timesharing system for the Cray 1, the Cray Time Sharing System (CTSS), was then developed at the Livermore Labs as a direct descendant of the Livermore Time Sharing System (LTSS) for the CDC 6600 operating system from twenty years earlier. In developing supercomputers, rising software costs soon became dominant, as evidenced by the 1980s cost for software development at Cray growing to equal their cost for hardware. That trend was partly responsible for a move away from the in-house Cray Operating System to UNICOS system based on Unix. In 1985, the Cray-2 was the first system to ship with the UNICOS operating system. Around the same time, the EOS operating system was developed by ETA Systems for use in their ETA10 supercomputers. Written in Cybil, a Pascal-like language from Control Data Corporation, EOS highlighted the stability problems in developing stable operating systems for supercomputers and eventually a Unix-like system was offered on the same machine. The lessons learned from developing ETA system software included the high level of risk associated with developing a new supercomputer operating system, and the advantages of using Unix with its large extant base of system software libraries. By the middle 1990s, despite the extant investment in older operating systems, the trend was toward the use of Unix-based systems, which also facilitated the use of interactive graphical user interfaces (GUIs) for scientific computing across multiple platforms. The move toward a commodity OS had opponents, who cited the fast pace and focus of Linux development as a major obstacle against adoption. As one author wrote "Linux will likely catch up, but we have large-scale systems now". Nevertheless, that trend continued to gain momentum and by 2005, virtually all supercomputers used some Unix-like OS. These variants of Unix included IBM AIX, the open source Linux system, and other adaptations such as UNICOS from Cray. By the end of the 20th century, Linux was estimated to command the highest share of the supercomputing pie. == Modern approaches == The IBM Blue Gene supercomputer uses the CNK operating system on the compute nodes, but uses a modified Linux-based kernel called I/O Node Kernel (INK) on the I/O nodes. CNK is a lightweight kernel that runs on each node and supports a single application running for a single user on that node. For the sake of efficient operation, the design of CNK was kept simple and minimal, with physical memory being statically mapped and the CNK neither needing nor providing scheduling or context switching. CNK does not even implement file I/O on the compute node, but delegates that to dedicated I/O nodes. However, given that on the Blue Gene multiple compute nodes share a single I/O node, the I/O node operating system does require multi-tasking, hence the selection of the Linux-based operating system. While in traditional multi-user computer systems and early supercomputers, job scheduling was in effect a task scheduling problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources. It is essential to tune task scheduling, and the operating system, in different configurations of a supercomputer. A typical parallel job scheduler has a master scheduler which instructs some number of slave schedulers to launch, monitor, and control parallel jobs, and periodically receives reports from them about the status of job progress. Some, but not all supercomputer schedulers attempt to maintain locality of job execution. The PBS Pro scheduler used on the Cray XT3 and Cray XT4 systems does not attempt to optimize locality on its three-dimensional torus interconnect, but simply uses the first available processor. On the other hand, IBM's scheduler on the Blue Gene supercomputers aims to exploit locality a

Scalable Video Coding

Scalable Video Coding (SVC) is a video compression standard developed jointly by the ITU-T and the ISO/IEC. The two organizations formed the Joint Video Team (JVT) to create the H.264/MPEG-4 AVC standard (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC). SVC aims to provide adaptable or scalable content, allowing a single encoded video stream to be decoded at various bitrates, resolutions, and quality levels, thus catering to diverse devices and network conditions. == History == In October 2003, the Moving Picture Experts Group (MPEG) issued a Call for Proposals on SVC Technology. Fourteen proposals were submitted, twelve of which utilized wavelet compression, while the remaining two were extensions of H.264/MPEG-4 AVC. The proposal from the Heinrich-Hertz-Institut (HHI) was selected by MPEG as the foundation for the SVC standardization project. In January 2005, MPEG and the Video Coding Experts Group (VCEG) agreed to finalize SVC as an amendment to the H.264/MPEG-4 AVC standard. In November 2008, Google launched Gmail Video Chat, which employed an H.264/SVC codec, marking the first consumer application of the standard. This service was succeeded by Google+ Hangouts in 2012. In 2011, Google Code highlighted SVC as the successor to the open-source RVC video chat engine, noting its prominence in 2010. == Principles of scalability == === Overview === Scalability refers to the ability to represent a video signal at multiple levels of detail within a single encoded bitstream. This enables decoding of a base layer for basic quality and additional enhancement layers for progressively higher quality. SVC defines three types of scalability: Spatial scalability: Supports multiple resolution levels. Temporal scalability: Enables varying frame rates. Quality scalability: Provides different image quality levels. === Spatial scalability === Spatial scalability allows the reconstruction of video at different resolutions, such as QCIF, CIF, or SD. This is achieved through a pyramidal decomposition into multiple spatial layers. === Temporal scalability === Temporal scalability adjusts the frame rate of the decoded video stream. Various frame rates are supported using a hierarchical structure of video frames. === Quality scalability === Quality scalability, or Signal-to-Noise Ratio (SNR) scalability, improves the signal-to-noise ratio of a layer, reducing quantization distortion between the original and reconstructed images. SVC supports two approaches: Fine Grain Scalability (FGS) and Coarse Grain Scalability (CGS). ==== Coarse Grain Scalability (CGS) ==== CGS incorporates quality scalability across spatial resolutions. Each spatial resolution is encoded as a separate layer, refining texture and motion data. For a given resolution, quality scalability is achieved by encoding multiple quality layers with progressively finer quantization steps, starting from a base layer with minimal quality. ==== Fine Grain Scalability (FGS) ==== FGS enables progressive refinement of transformed coefficients within a single spatial layer. The base quality layer is encoded using the AVC standard with an initial quantization parameter (QP) ensuring minimal acceptable quality. Subsequent refinement layers reduce the QP by six, halving the quantization step. The refinement data stream can be truncated at any point, allowing fine-grained quality scalability.

Croissant (metadata format)

Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML. == Structure == Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic: The Dataset Metadata layer constrains which schema.org properties should be used, including additional properties, linking together the resources (files) of the dataset with general metadata, like licensing and citation information. The Resource layer describes the individual files and sets of those using two new classes, FileObject and FileSet. A FileSet may be a collection of related images. The Structure layer specifies how the files are organized in the dataset. A RecordSet class describes how resources are present, configurations that may very a lot between modality. This specification facilitates interoperability of the datasets. Finally, the Semantic layer adds information for practical reuse of the dataset, such as splits for train, test and validation subsets. It also provides a default extension for metadata related to responsible AI. The use of a standard machine-readable structure increases, for example, the discoverability of datasets in search engines such as Google Dataset Search. == History == Croissant was shared in arXiv in March 2024 and published in the proceedings of NeurIPS 2024. It started as community driven as a MLCommons Croissant Working Group, including stakeholders organizations from academia and industry, including Google, the open data institute, Sage Bionetworks and King's College London. Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets. Other technical extensions, such as support for RDF, soon followed.

1tik

1tik, pronounced Antik (Arabic: أنتيك; lit. "Everything is going well") is a fully Algerian instant messaging, social media and mobile payment app. designed, developed and built locally by the Algerian start-up, INTAJ Digital, with backing from the state-owned company ATM Mobilis (who's the company's main sponsor). It is described as Algeria's first super-app that is entirely designed and built by local developers. == Etymology == The name "1tik" (Arabic: أنتيك) is drawn from the popular Algerian vernacular (Antik), the neologism, which appeared several years ago, means "everything is going well" or "it's all good". == History == 1tik was officially launched and announced the 20th December 2025 by INTAJ Digital's founder Youcef Toulaib and a team of 50 employees, making it the first ever Algerian instant messaging, social media and mobile payment app, rivaling with the growing influence of Yassir in Algeria. it grew in popularity after the presidency of Algeria and several other state-owned companies, medias, and ministries opened official accounts on the app.