General Regionally Annotated Corpus of Ukrainian

General Regionally Annotated Corpus of Ukrainian

General Regionally Annotated Corpus of the Ukrainian Language (GRAC, Ukrainian: Генеральний регіонально анотований корпус української мови, romanized: Heneralnyi rehionalno anotovanyi korpus ukrainskoi movy, ГРАК, Ukrainian грак for rook) is a text corpus of the Ukrainian language comprising more than 2 billion tokens, intended for linguistic research in grammar, vocabulary, and the history of the Ukrainian literary language, as well as for use in compiling dictionaries and grammars. The corpus can be used for language study and also for preparing teaching materials, textbooks, learner’s dictionaries, and exercises using examples from real texts, taking into account frequency and collocational patterns, and so on. The corpus is not a model of standard Ukrainian: it may contain words and combinations that do not match current norms of the literary language. The corpus covers the period from 1816 to 2025, and as of 29 November 2025 it contains more than 812,000 texts by about 35,000 authors. == Composition of the corpus == In the 10th version of the corpus, available for searching from 20 October 2020, 35% consists of fiction. Some fiction genres are highlighted separately: children’s literature, folklore, dramatic works, and scripts. Among non-fiction texts: journalistic writing, including newspaper collections from 1888–1893, 1905, 1913–1918, 1919–1943, modern newspapers from different regions, and texts from online news/information sites; memoirs, letters, and diaries, including a sizeable corpus of Facebook texts representing blogs by people from all regions of Ukraine and the diaspora; scholarly and educational texts: monographs, dissertations, academic articles, textbooks; large subcorpora of academic literature in history, ethnography, philosophy, and law are singled out separately; religious texts, including two Ukrainian translations of the Bible; speeches and interviews. Some dictionaries that include phrasal examples and phraseology have also been incorporated, including the Ukrainian dictionary by Borys Hrinchenko and the Russian-Ukrainian idiomatic dictionary by I. Vyrhan and M. Pylynska. Using the corpus tools, these dictionaries can be searched not only for words, but also for lexico-grammatical patterns within examples and phraseological expressions. About 20% of the texts in the corpus are translations. The corpus includes translations from more than 80 languages, most of all from English and Russian. == Dating == Texts in the corpus are dated by the year of writing, or by the latest year in which a work could have been written; translated texts are dated by the year the translation was produced. A publication year may also be indicated, corresponding to the edition from which the text is taken. == Regional annotation == The corpus’s regional annotation is based on the modern administrative division of Ukraine. The corpus includes texts from all oblasts of Ukraine and from Crimea. A single text may belong to several regional subcorpora (if the author or translator was born, studied, or lived for a long time in different regions). In addition to regional subcorpora, there are subcorpora of works by authors of the Ukrainian diaspora (USA, Canada, Poland, Germany, the United Kingdom, France, etc.). These are mostly texts by emigrants of the 1940s, and to a lesser extent of 1917–1920s. == Morphological annotation == GRAC is based on the morphological analysis system nlp_uk, developed by specialists from the r2u group. The program analyzes the text and, for each word form, determines the lemma (lexeme) and tags (grammatical features). == Research based on the corpus == Research on the Ukrainian language has been carried out using the corpus, including studies of the historical dynamics of language norms, and letter and letter-combination frequencies for font development.

Data access layer

A data access layer (DAL) is a software architectural layer that provides access to data from one or more sources, such as a relational database, NoSQL database, SQL query engine, file system, or other persistent storage. It separates client code from the details of storage systems, query execution, connection handling, and data retrieval. Data access layers are commonly used to centralize data access logic, reduce coupling between applications and data sources, and provide a consistent interface for retrieving, writing, or querying data. Depending on the system, a data access layer may be implemented as application code, a shared library, an intermediary service, or part of a broader database abstraction layer. == In application architecture == In application software, a data access layer provides a boundary between business logic or application code and the systems used to store or retrieve data. For example, a data access layer may expose methods or interfaces for retrieving, writing, or querying data while hiding details such as connection management, SQL statements, storage APIs, error handling, and result conversion. Depending on the application, the layer may return objects, records, tabular results, documents, streams, or other representations of data. A common implementation is a set of classes, functions, or methods that directly reference database queries, stored procedures, storage APIs, or other data sources. For example, instead of using commands such as insert, delete, and update throughout an application to access a specific table, methods such as registerUser or loginUser may be implemented inside the data access layer. Business logic methods from an application can also be mapped to the data access layer. Instead of making several database queries directly, an application can call a single DAL method that abstracts those database calls. Applications using a data access layer may be either dependent on or independent from a particular database server. If the data access layer supports multiple database systems, the application can use any database system that the DAL can access. In either case, the data access layer provides a centralized location for calls into the underlying data store, which can make it easier to maintain, test, or port the application to other storage systems. == Implementation patterns == A data access layer can be implemented using several patterns and technologies, including data access objects, repositories, stored procedures, query builders, database drivers, or object–relational mapping tools. These mechanisms may implement part or all of a data access layer, but are not always equivalent to the layer itself. Object–relational mapping tools are commonly used in data access layers for object-oriented applications that map records in a relational database to objects in a programming language. Other data access layers may expose lower-level database interfaces, tabular results, document-oriented data, files, streams, or protocol-level interfaces. == Use with multiple underlying data systems == A data access layer may be used to abstract differences between multiple underlying data systems, allowing applications to access them through a more consistent interface. In such designs, applications call the DAL rather than interacting directly with each database or storage system. The layer may then handle connection management, query generation, result mapping, error handling, and other implementation details. A data access layer may be implemented as a shared library or as an intermediary service, such as a proxy or gateway. In this configuration, client applications or services connect to the data access layer, which then communicates with one or more underlying databases or query engines. This can provide a common location for authentication, authorization, logging, routing, and translation between different database interfaces. == Interfaces and protocols == Data access layers may expose or use standardized interfaces and protocols for database access. Examples include Open Database Connectivity (ODBC), Java Database Connectivity (JDBC), database-native wire protocols, and newer interfaces such as Apache Arrow Database Connectivity (ADBC) and Arrow Flight SQL. In systems that support multiple data stores, a data access layer may provide a consistent interface while using different drivers, protocols, or query mechanisms internally. == Distinction from related patterns == A data access layer is related to, but broader than, a data access object, which is usually an object-oriented design pattern for encapsulating access to a persistence mechanism. It is also related to a database abstraction layer, which focuses on hiding differences between database systems. In practice, the terms may overlap.

Michael Kohlhase

Michael Kohlhase (born 13 September 1964, in Erlangen) is a German computer scientist and professor at University of Erlangen–Nuremberg, where he is head of the KWARC research group (Knowledge Adaptation and Reasoning for Content). == Academic Positions == Michael Kohlhase is president of the OpenMath Society and a trustee of the Interest Group for Mathematical Knowledge Management (MKM). He was a trustee of the Conference on Automated Deduction and the CALCULEMUS Interest Group. He has been Conference Chair of CADE-21 and Program Chair of the KI-2006, MKM-2005, and CALCULEMUS-2000 conferences and has served on the Programme Committees of more than three dozen international conferences. Kohlhase holds an adjunct associate professorship at Carnegie Mellon University and was (2006–2008) vice director of the Department of Safe and Secure Cognitive Systems at German Research Centre for Artificial Intelligence (DFKI) Lab Bremen. In 2014, he became a member of the Global Digital Mathematics Library Working Group of the IMU. == Academic career == Michael Kohlhase obtained a degree in Mathematics (1989) from University of Bonn, a doctorate (1994) and habilitation (1999) in Computer Science at Saarland University. He has pursued his doctoral and post-doctoral research in extended research visits at Carnegie Mellon University, University of Amsterdam, the University of Edinburgh, and SRI International. From 2000–2003, he has conducted research and taught at the School of Computer Science at Carnegie Mellon University, where he was appointed to an adjunct associate professor. In September 2003 he was appointed as Professor of Computer Science at Jacobs University Bremen (International University Bremen until 2007), and 2006–2008 he was vice director of the Department of Safe and Secure Cognitive Systems of the German Research Centre for Artificial Intelligence (DFKI) Bremen. Since September 2016 he holds the Professorship for Knowledge Representation and Processing at University of Erlangen–Nuremberg. He has authored or edited four books and published almost 100 peer-reviewed papers. == Awards and Scholarships == 2000 3-year Heisenberg-Stipend of the Deutsche Forschungsgemeinschaft (DFG). 1996 AKI-prize, dissertation prize of the "Arbeitsgemeinschaft deutscher KI-Institute (AKI)" 1991 dissertation stipend of the Studienstiftung (German National Academic Foundation) 1986 masters stipend of Studienstiftung == Research interests == Michael Kohlhase's current research interests include Automated theorem proving and knowledge representation for mathematics, inference-based techniques for natural language processing and semantics, and computer-supported education. Much of his concrete work is based on web-based content markup formats like MathML, OpenMath, and OMDoc and systems for managing this data, e.g. semantic search engines for mathematical formulae, semantic extensions to LaTeX, or converting legacy LaTeX documents from the arXiv.

Struc2vec

struc2vec is a framework to generate node vector representations on a graph that preserve the structural identity. In contrast to node2vec, that optimizes node embeddings so that nearby nodes in the graph have similar embedding, struc2vec captures the roles of nodes in a graph, even if structurally similar nodes are far apart in the graph. It learns low-dimensional representations for nodes in a graph, generating random walks through a constructed multi-layer graph starting at each graph node. It is useful for machine learning applications where the downstream application is more related with the structural equivalence of the nodes (e.g., it can be used to detect nodes in networks with similar functions, such as interns in the social network of a corporation). struc2vec identifies nodes that play a similar role based solely on the structure of the graph, for example computing the structural identity of individuals in social networks. In particular, struc2vec employs a degree-based method to measure the pairwise structural role similarity, which is then adopted to build the multi-layer graph. Moreover, the distance between the latent representation of nodes is strongly correlated to their structural similarity. The framework contains three optimizations: reducing the length of degree sequences considered, reducing the number of pairwise similarity calculations, and reducing the number of layers in the generated graph. struc2vec follows the intuition that random walks through a graph can be treated as sentences in a corpus. Each node in a graph is treated as an individual word, and short random walk is treated as a sentence. In its final phase, the algorithm employs Gensim's word2vec algorithm to learn embeddings based on biased random walks. Sequences of nodes are fed into a skip-gram or continuous bag of words model and traditional machine-learning techniques for classification can be used. It is considered a useful framework to learn node embeddings based on structural equivalence.

Suffix automaton

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string S {\displaystyle S} is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string. In terms of automata theory, a suffix automaton is the minimal partial deterministic finite automaton that recognizes the set of suffixes of a given string S = s 1 s 2 … s n {\displaystyle S=s_{1}s_{2}\dots s_{n}} . The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any deterministic acyclic finite state automaton. Suffix automata were introduced in 1983 by a group of scientists from the University of Denver and the University of Colorado Boulder. They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string S {\displaystyle S} having length at least two characters has at most 2 | S | − 1 {\textstyle 2|S|-1} states and at most 3 | S | − 4 {\textstyle 3|S|-4} transitions. Further works have shown a close connection between suffix automata and suffix trees, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc. Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings. == History == The concept of suffix automaton was introduced in 1983 by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner, Vaughan Pratt and Anatol Slissenko. In their initial work, Blumer et al. showed a suffix automaton built for the string S {\displaystyle S} of length greater than 1 {\displaystyle 1} has at most 2 | S | − 1 {\displaystyle 2|S|-1} states and at most 3 | S | − 4 {\displaystyle 3|S|-4} transitions, and suggested a linear algorithm for automaton construction. In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm while building a suffix tree of the string S {\displaystyle S} constructs a suffix automaton of the reversed string S R {\textstyle S^{R}} as an auxiliary structure. In 1987, Blumer et al. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG). In 1997, Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction. In 2001, Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a trie. == Definitions == Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular: "Alphabet" is a finite set Σ {\displaystyle \Sigma } that is used to construct words. Its elements are called "characters"; "Word" is a finite sequence of characters ω = ω 1 ω 2 … ω n {\displaystyle \omega =\omega _{1}\omega _{2}\dots \omega _{n}} . "Length" of the word ω {\displaystyle \omega } is denoted as | ω | = n {\displaystyle |\omega |=n} ; "Formal language" is a set of words over given alphabet; "Language of all words" is denoted as Σ ∗ {\displaystyle \Sigma ^{}} (where the "" character stands for Kleene star), "empty word" (the word of zero length) is denoted by the character ε {\displaystyle \varepsilon } ; "Concatenation of words" α = α 1 α 2 … α n {\displaystyle \alpha =\alpha _{1}\alpha _{2}\dots \alpha _{n}} and β = β 1 β 2 … β m {\displaystyle \beta =\beta _{1}\beta _{2}\dots \beta _{m}} is denoted as α ⋅ β {\displaystyle \alpha \cdot \beta } or α β {\displaystyle \alpha \beta } and corresponds to the word obtained by writing β {\displaystyle \beta } to the right of α {\displaystyle \alpha } , that is, α β = α 1 α 2 … α n β 1 β 2 … β m {\displaystyle \alpha \beta =\alpha _{1}\alpha _{2}\dots \alpha _{n}\beta _{1}\beta _{2}\dots \beta _{m}} ; "Concatenation of languages" A {\displaystyle A} and B {\displaystyle B} is denoted as A ⋅ B {\displaystyle A\cdot B} or A B {\displaystyle AB} and corresponds to the set of pairwise concatenations A B = { α β : α ∈ A , β ∈ B } {\displaystyle AB=\{\alpha \beta :\alpha \in A,\beta \in B\}} ; If the word ω ∈ Σ ∗ {\displaystyle \omega \in \Sigma ^{}} may be represented as ω = α γ β {\displaystyle \omega =\alpha \gamma \beta } , where α , β , γ ∈ Σ ∗ {\displaystyle \alpha ,\beta ,\gamma \in \Sigma ^{}} , then words α {\displaystyle \alpha } , β {\displaystyle \beta } and γ {\displaystyle \gamma } are called "prefix", "suffix" and "subword" (substring) of the word ω {\displaystyle \omega } correspondingly; If T = T 1 … T n {\displaystyle T=T_{1}\dots T_{n}} and T l T l + 1 … T r = S {\displaystyle T_{l}T_{l+1}\dots T_{r}=S} (with 1 ≤ l ≤ r ≤ n {\displaystyle 1\leq l\leq r\leq n} ) then S {\displaystyle S} is said to "occur" in T {\displaystyle T} as a subword. Here l {\displaystyle l} and r {\displaystyle r} are called left and right positions of occurrence of S {\displaystyle S} in T {\displaystyle T} correspondingly. == Automaton structure == Formally, deterministic finite automaton is determined by 5-tuple A = ( Σ , Q , q 0 , F , δ ) {\displaystyle {\mathcal {A}}=(\Sigma ,Q,q_{0},F,\delta )} , where: Σ {\displaystyle \Sigma } is an "alphabet" that is used to construct words, Q {\displaystyle Q} is a set of automaton "states", q 0 ∈ Q {\displaystyle q_{0}\in Q} is an "initial" state of automaton, F ⊂ Q {\displaystyle F\subset Q} is a set of "final" states of automaton, δ : Q × Σ ↦ Q {\displaystyle \delta :Q\times \Sigma \mapsto Q} is a partial "transition" function of automaton, such that δ ( q , σ ) {\displaystyle \delta (q,\sigma )} for q ∈ Q {\displaystyle q\in Q} and σ ∈ Σ {\displaystyle \sigma \in \Sigma } is either undefined or defines a transition from q {\displaystyle q} over character σ {\displaystyle \sigma } . Most commonly, deterministic finite automaton is represented as a directed graph ("diagram") such that: Set of graph vertices corresponds to the state of states Q {\displaystyle Q} , Graph has a specific marked vertex corresponding to initial state q 0 {\displaystyle q_{0}} , Graph has several marked vertices corresponding to the set of final states F {\displaystyle F} , Set of graph arcs corresponds to the set of transitions δ {\displaystyle \delta } , Specifically, every transition δ ( q 1 , σ ) = q 2 {\textstyle \delta (q_{1},\sigma )=q_{2}} is represented by an arc from q 1 {\displaystyle q_{1}} to q 2 {\displaystyle q_{2}} marked with the character σ {\displaystyle \sigma } . This transition also may be denoted as q 1 σ ⟶ q 2 {\textstyle q_{1}{\begin{smallmatrix}{\sigma }\\[-5pt]{\longrightarrow }\end{smallmatrix}}q_{2}} . In terms of its diagram, the automaton recognizes the word ω = ω 1 ω 2 … ω m {\displaystyle \omega =\omega _{1}\omega _{2}\dots \omega _{m}} only if there is a path from the initial vertex q 0 {\displaystyle q_{0}} to some final vertex q ∈ F {\displaystyle q\in F} such that concatenation of characters on this path forms ω {\displaystyle \omega } . The set of words recognized by an automaton forms a language that is set to be recognized by the automaton. In these terms, the language recognized by a suffix automaton of S {\displaystyle S} is the language of its (possibly empty) suffixes. === Automaton states === "Right context" of the word ω {\displaystyle \omega } with respect to language L {\displaystyle L} is a set [ ω ] R = { α : ω α ∈ L } {\displaystyle [\omega ]_{R}=\{\alpha :\omega \alpha \in L\}} that is a set of words α {\displaystyle \alpha } such that their concatenation with ω {\displaystyle \omega } forms a word from L {\displaystyle L} . Right contexts induce a natural equivalence relation [ α ] R = [ β ] R {\displaystyle [\alpha ]_{R}=[\beta ]_{R}} on the set of all words. If language L {\displaystyle L} is recognized by some deterministic finite automaton, there exists unique up to isomorphism automaton that recognizes the same language and has the minimum possible number of states. Such an automaton is called a minimal automaton for the given language L {\displaystyle L} . Myhill–Nerode theorem allows it to define it explicitly in terms of right contexts: In these terms, a "suffix automaton" is the minimal deterministic finite automaton recognizing the language of suffixes of the word S = s 1 s 2 … s n {\displaystyle S=s_{1}s_{2}\dots s_{n}} . The right context of the word ω {\displaystyle \omeg

Texture artist

A texture artist is an individual who develops textures for digital media, usually for video games, movies, web sites and television shows or things like 3D posters. These textures can be in the form of 2D or (rarely) 3D art that may be overlaid onto a polygon mesh to create a realistic 3D model. Texture artists often take advantage of web sites for the purposes of marketing their art and self-promotion of their skills with the goal of gaining employment from a professional game studio or to join a team working on a "mod" (modification) of an existing game in hopes of establishing industry or trade credentials.

HFST

Helsinki Finite-State Technology (HFST) is a computer programming library and set of utilities for natural language processing with finite-state automata and finite-state transducers. It is free and open-source software, released under a mix of the GNU General Public License version 3 (GPLv3) and the Apache License. == Features == The library functions as an interchanging interface to multiple backends, such as OpenFST, foma and SFST. The utilities comprise various compilers, such as hfst-twolc (a compiler for morphological two-level rules), hfst-lexc (a compiler for lexicon definitions) and hfst-regexp2fst (a regular expression compiler). Functions from Xerox's proprietary scripting language xfst is duplicated in hfst-xfst, and the pattern matching utility pmatch in hfst-pmatch, which goes beyond the finite-state formalism in having recursive transition networks (RTNs). The library and utilities are written in C++, with an interface to the library in Python and a utility for looking up results from transducers ported to Java and Python. Transducers in HFST may incorporate weights depending on the backend. For performing FST operations, this is currently only possible via the OpenFST backend. HFST provides two native backends, one designed for fast lookup (hfst-optimized-lookup), the other for format interchange. Both of them can be weighted. == Uses == HFST has been used for writing various linguistic tools, such as spell-checkers, hyphenators, and morphologies. Morphological dictionaries written in other formalisms have also been converted to HFST's formats.