AI Data Center Financing Surge

AI Data Center Financing Surge — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Catie Cuan

    Catie Cuan

    Catie Cuan is an artist, entrepeuneur, and innovator in the field of robotic art and human-robot interaction, where she specializes in choreorobotics, an emerging field at the intersection of choreographic dance and robotics. Catie Cuan is currently one of the academic researchers pioneering the field of choreorobotics and currently holds a post-doctoral fellowship at Stanford University. == Career == Catie Cuan earned a bachelor's degree from the University of California, Berkeley. She graduated with a Ph.D. from the Department of Mechanical Engineering at Stanford University, focusing in robotics. Her most cited publication is about how to improve robotic expressive systems using tools from dance theory, such as the Laban/Bartenieff Movement Analysis. In her most recent research projects, she explores a predictive model of imitation learning for robots moving around humans, a project that advances the field of social robotics. Cuan credits her work in robotics to the experience with her father when he had a stroke and was surrounded by many medical machines, which made her think about how people might feel empowered and hopeful rather than afraid. As a ballet dancer and choreographer, she has performed with the Metropolitan Opera Ballet and the Lyric Opera of Chicago. In 2020, she was the dancer and choreographer of the show Output, which was part of a collaboration with ThoughtWorks Arts and the Pratt Institute. In the production, she danced with an ABB IRB 6700 industrial robot. In 2022, she was named as an IF/THEN ambassador for the American Association for the Advancement of Science. The same year, she was appointed Futurist-in-Residence at the Smithsonian Arts and Industries Building, where she performed at the closing ceremonies of the FUTURES exhibit on July 6, 2022. Cuan has also contributed to product designs, working with IDEO and Dutch interior design firm moooi on their Piro project, which launched a dancing scent diffuser robot during Milan Design Week in June 2022. She is a TED speaker with talks about how to teach robots to dance, and what is coming up for dancing robots in the AI era.

    Read more →
  • Master data management

    Master data management

    Master data management (MDM) is a discipline in which business and information technology collaborate to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise's official shared master data assets. == Reasons for master data management == Data consistency and accuracy: MDM ensures that the organization's critical data is consistent and accurate across all systems, reducing discrepancies and errors caused by multiple, siloed copies of the same data. Improved decision-making: By providing a single version of the truth (SVOT), MDM enables organizations to deliver the right data to decision makers, allowing them to clearly understand business performance and make informed, data-driven decisions. Operational efficiency: With the consistent and accurate data provided by an MDM, operational processes such as reporting and inventory management can be automated to improve efficiency. Employee learning, onboarding, and customer service also become more efficient, as MDM data facilitates rapid, accurate, and thorough information retrieval, permitting more employee time to be spent on work. Regulatory compliance: MDM tries to help organizations comply with industry standards and regulations by ensuring that master data is accurately recorded, maintained, and audited. However, issues with data quality, classification, and reconciliation may require data transformation. As with other Extract, Transform, Load-based data movements, these processes are expensive and inefficient, reducing return on investment for a project. == Business unit and product line segmentation == As a result of business unit and product line segmentation, the same entity (whether a customer, supplier, or product) will be included in different product lines. This leads to data redundancy and even confusion. For example, a customer takes out a mortgage at a bank. If the marketing and customer service departments have separate databases, advertisements might still be sent to the customer, even though they've already signed up. The two parts of the bank are unaware, and the customer is sent irrelevant communications. Record linkage can associate different records corresponding to the same entity, mitigating this issue. == Mergers and acquisitions == One of the most common problems for master data management is company growth through mergers or acquisitions. Reconciling these separate master data systems can present difficulties, as existing applications have dependencies on the master databases. Ideally, database administrators resolve this problem through deduplication of the master data as part of the merger. Over time, as further mergers and acquisitions occur, the problem can multiply. Data reconciliation processes can become extremely complex or even unreliable. Some organizations end up with 10, 15, or even 100 separate and poorly integrated master databases. This can cause serious problems in customer satisfaction, operational efficiency, decision support, and regulatory compliance. Another problem involves determining the proper degrees of detail and normalization to include in the master data schema. For example, in a federated Human Resources environment, the enterprise software may focus on storing people's data as current status, adding a few fields to identify the date of hire, date of last promotion, etc. However, this simplification can introduce business-impacting errors into dependent systems for planning and forecasting. The stakeholders of such systems may be forced to build a parallel network of new interfaces to track the onboarding of new hires, planned retirements, and divestment, which works against one of the aims of master data management. == People, processes and technology == Master data management is enabled by technology, but is more than the technologies that enable it. An organization's master data management capability will also include people and processes in its definition. === People === Several roles should be staffed within MDM. Most prominently, the Data Owner and the Data Steward. Several people would likely be allocated to each role and each person responsible for a subset of Master Data (e.g. one data owner for employee master data, another for customer master data). The Data Owner is responsible for the requirements for data definition, data quality, data security, etc. as well as for compliance with data governance and data management procedures. The Data Owner should also be funding improvement projects in case of deviations from the requirements. The Data Steward is running the master data management on behalf of the data owner and probably also being an advisor to the Data Owner. === Processes === Master data management can be viewed as a "discipline for specialized quality improvement" defined by the policies and procedures put in place by a data governance organization. It has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing master data throughout an organization to ensure a common understanding, consistency, accuracy and control, in the ongoing maintenance and application use of that data. Processes commonly seen in master data management include source identification, data collection, data transformation, normalization, rule administration, error detection and correction, data consolidation, data storage, data distribution, data classification, taxonomy services, item master creation, schema mapping, product codification, data enrichment, hierarchy management, business semantics management and data governance. === Technology === A master data management tool can be used to support master data management by removing duplicates, standardizing data (mass maintaining), and incorporating rules to eliminate incorrect data from entering the system to create an authoritative source of master data. Master data are the products, accounts, and parties for which the business transactions are completed. Where the technology approach produces a "golden record" or relies on a "source of record" or "system of record", it is common to talk of where the data is "mastered". This is accepted terminology in the information technology industry, but care should be taken, both with specialists and with the wider stakeholder community, to avoid confusing the concept of "master data" with that of "mastering data". ==== Implementation models ==== There are several models for implementing a technology solution for master data management. These depend on an organization's core business, its corporate structure, and its goals. These include: Source of record Registry Consolidation Coexistence Transaction/centralized ===== Source of record ===== This model identifies a single application, database, or simpler source (e.g. a spreadsheet) as being the "source of record" (or "system of record" where solely application databases are relied on). The benefit of this model is its conceptual simplicity, but it may not fit with the realities of complex master data distribution in large organizations. The source of record can be federated, for example by groups of attributes (so that different attributes of a master data entity may have different sources of record) or geographically (so that different parts of an organization may have different master sources). Federation is only applicable in certain use cases, where there is a clear delineation of which subsets of records will be found in which sources. The source of record model can be applied more widely than simply to master data, for example to reference data. ==== Transmission of master data ==== There are several ways in which master data may be collated and distributed to other systems. This includes: Data consolidation – The process of capturing master data from multiple sources and integrating it into a single hub (operational data store) for replication to other destination systems. Data federation – The process of providing a single virtual view of master data from one or more sources to one or more destination systems. Data propagation – The process of copying master data from one system to another, typically through point-to-point interfaces in legacy systems. == Change management in implementation == Challenges in adopting master data management within large organizations often arise when stakeholders disagree on a "single version of the truth" concept is not affirmed by stakeholders, who believe that their local definition of the master data is necessary. For example, the product hierarchy used to manage inventory may be entirely different from the product hierarchies used to support marketing efforts or pay sales representatives. It is above all necessary to identify if different master data is genuinely required. If it is required, then the solution implemented (technology and process) must be able to allow multiple versions of the truth to exist but will prov

    Read more →
  • TurboQuant

    TurboQuant

    TurboQuant is an online vector quantization algorithm for compressing high-dimensional Euclidean vectors while preserving their geometric structure. It was proposed in 2025 by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. The paper lists Zandieh and Mirrokni as affiliated with Google Research, Daliri with New York University, and Hadian with Google DeepMind. The method was developed for applications including large language model (LLM) inference, key–value (KV) cache compression, vector databases, and nearest neighbor search. TurboQuant consists of two related algorithms: TurboQuantmse, which is optimized for mean squared error (MSE), and TurboQuantprod, which is optimized for unbiased inner product estimation. The algorithm uses a random rotation of input vectors, applies scalar quantizers to the rotated coordinates, and, for inner-product estimation, applies a one-bit Quantized Johnson–Lindenstrauss (QJL) transform to the residual error. == Background == Vector quantization is a compression method that maps high-dimensional vectors to a finite set of codewords. The problem has roots in Shannon's source coding theory and rate–distortion theory. In machine learning and information retrieval, vector quantization is used to reduce the memory required to store embeddings, activation vectors, and other numerical representations. In Transformer-based large language models, the KV cache stores key and value vectors from previous tokens during autoregressive decoding. The size of this cache grows with context length, the number of attention heads, and the number of concurrent requests, making it a major memory bottleneck in LLM serving. Similar compression problems appear in vector search, where large collections of embedding vectors must be stored and searched efficiently. Earlier approaches to vector quantization include product quantization, scalar quantization, and data-dependent k-means codebook construction. The TurboQuant paper argues that many existing methods either require offline preprocessing and calibration or suffer from suboptimal distortion guarantees in online settings. == Algorithm == === TurboQuantmse === TurboQuantmse is the version of the algorithm optimized for mean-squared error. For a unit vector x ∈ S d − 1 {\displaystyle x\in S^{d-1}} , the algorithm first applies a random rotation matrix Π ∈ R d × d {\displaystyle \Pi \in \mathbb {R} ^{d\times d}} and sets z = Π x {\displaystyle z=\Pi x} . Each coordinate of the rotated vector follows a shifted and scaled beta distribution, which converges to a normal distribution in high dimensions. In high dimensions, distinct coordinates also become nearly independent, allowing the algorithm to apply scalar quantizers independently to each coordinate. The scalar quantizer is constructed by solving a one-dimensional continuous k-means or Lloyd–Max quantization problem. If the centroids are c 1 , c 2 , … , c 2 b {\displaystyle c_{1},c_{2},\ldots ,c_{2^{b}}} , the quantization step stores, for each coordinate, i d x j = ⁡ a r g m i n k ∈ [ 2 b ] | z j − c k | . {\displaystyle \mathrm {idx} _{j}=\operatorname {} {arg\,min}_{k\in [2^{b}]}|z_{j}-c_{k}|.} During dequantization, the stored index for each coordinate is replaced by the corresponding centroid, giving a reconstructed rotated vector z ~ {\displaystyle {\tilde {z}}} . The algorithm then rotates back: x ~ = Π ⊤ z ~ . {\displaystyle {\tilde {x}}=\Pi ^{\top }{\tilde {z}}.} The paper gives the following bound for TurboQuantmse: D m s e ≤ 3 π 2 ⋅ 1 4 b . {\displaystyle D_{\mathrm {mse} }\leq {\frac {\sqrt {3\pi }}{2}}\cdot {\frac {1}{4^{b}}}.} It also reports finer-grained MSE values of approximately 0.36, 0.117, 0.03, and 0.009 for bit-widths b = 1 , 2 , 3 , 4 {\displaystyle b=1,2,3,4} , respectively. === TurboQuantprod === TurboQuantprod is optimized for unbiased inner-product estimation. The authors note that an MSE-optimized quantizer may introduce bias when used to estimate inner products. To address this, TurboQuantprod first applies TurboQuantmse with bit-width b − 1 {\displaystyle b-1} , then applies a one-bit Quantized Johnson–Lindenstrauss transform to the remaining residual vector. Let r = x − Q m s e − 1 ( Q m s e ( x ) ) {\displaystyle r=x-Q_{\mathrm {mse} }^{-1}(Q_{\mathrm {mse} }(x))} be the residual after MSE quantization, and let γ = ‖ r ‖ 2 {\displaystyle \gamma =\|r\|_{2}} . The QJL step stores a sign vector for the residual. For γ ≠ 0 {\displaystyle \gamma \neq 0} , this can be written using the normalized residual u = r / γ {\displaystyle u=r/\gamma } : q j l = sign ⁡ ( S u ) , {\displaystyle qjl=\operatorname {sign} (Su),} where S ∈ R d × d {\displaystyle S\in \mathbb {R} ^{d\times d}} is a random projection matrix. Since the sign function is invariant under positive rescaling, this is equivalent to sign ⁡ ( S r ) {\displaystyle \operatorname {sign} (Sr)} when r ≠ 0 {\displaystyle r\neq 0} . If γ = 0 {\displaystyle \gamma =0} , the residual correction is zero. TurboQuantprod stores the MSE quantization, the QJL sign vector, and the residual norm: Q p r o d ( x ) = [ Q m s e ( x ) , q j l , γ ] . {\displaystyle Q_{\mathrm {prod} }(x)=\left[Q_{\mathrm {mse} }(x),qjl,\gamma \right].} The dequantized vector is reconstructed as x ~ = x ~ m s e + π / 2 d γ S ⊤ q j l . {\displaystyle {\tilde {x}}={\tilde {x}}_{\mathrm {mse} }+{\frac {\sqrt {\pi /2}}{d}}\,\gamma S^{\top }qjl.} The paper proves that TurboQuantprod is unbiased for inner-product estimation: E x ~ [ ⟨ y , x ~ ⟩ ] = ⟨ y , x ⟩ . {\displaystyle \mathbb {E} _{\tilde {x}}\left[\langle y,{\tilde {x}}\rangle \right]=\langle y,x\rangle .} It also gives the distortion bound D p r o d ≤ 3 π 2 ⋅ ‖ y ‖ 2 2 d ⋅ 1 4 b . {\displaystyle D_{\mathrm {prod} }\leq {\frac {\sqrt {3\pi }}{2}}\cdot {\frac {\|y\|_{2}^{2}}{d}}\cdot {\frac {1}{4^{b}}}.} == Performance and applications == The TurboQuant paper reports that the algorithm achieves near-optimal distortion rates within a small constant factor of information-theoretic lower bounds. The authors report that, for KV cache quantization, TurboQuant achieved quality neutrality at 3.5 bits per channel and marginal degradation at 2.5 bits per channel. In long-context LLM experiments using Llama 3.1 8B Instruct, the paper evaluated the method on a "needle-in-a-haystack" retrieval task with document lengths from 4,000 to 104,000 tokens. It reported that TurboQuant matched the uncompressed full-precision baseline while using more than 4× compression, and compared the method against PolarQuant, SnapKV, PyramidKV, and KIVI. Google Research stated that TurboQuant was evaluated on long-context benchmarks including LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval using open-source models including Gemma and Mistral. According to a report in Tom's Hardware, Google described the method as reducing KV-cache memory by at least six times and achieving up to an eightfold improvement in attention-logit computation on Nvidia H100 GPUs compared with unquantized 32-bit keys. TurboQuant has also been applied to nearest-neighbor vector search. The original paper reports experiments on DBpedia entity embeddings and GloVe embeddings, comparing TurboQuant with product quantization and other vector-search quantization baselines. == Relationship to other methods == TurboQuant is related to several methods for efficient large language model inference and high-dimensional search: Product quantization – a vector quantization technique widely used for approximate nearest-neighbor search Quantization (machine learning) – reducing the numerical precision of weights, activations, or cached tensors in machine learning models PagedAttention – a memory-management algorithm for LLM serving that reduces fragmentation in the KV cache Johnson–Lindenstrauss lemma – a result in high-dimensional geometry used in random projection methods Lloyd's algorithm – an algorithm for scalar and vector quantization, including k-means-style codebook construction Unlike PagedAttention, which focuses on memory allocation and cache layout, TurboQuant reduces the numerical storage cost of the vectors themselves. Unlike many product-quantization methods, TurboQuant is designed to be data-oblivious and online, avoiding dataset-specific codebook training. == Limitations == The strongest performance claims for TurboQuant come from the original paper and Google Research's own publication. Coverage in technology media has noted that the broader impact of the method will depend on real-world implementation details, workloads, and hardware architectures.

    Read more →
  • Linguistic categories

    Linguistic categories

    Linguistic categories include Lexical category, a part of speech such as noun, preposition, etc. Syntactic category, a similar concept which can also include phrasal categories Grammatical category, a grammatical feature such as tense, gender, etc. The definition of linguistic categories is a major concern of linguistic theory, and thus, the definition and naming of categories varies across different theoretical frameworks and grammatical traditions for different languages. The operationalization of linguistic categories in lexicography, computational linguistics, natural language processing, corpus linguistics, and terminology management typically requires resource-, problem- or application-specific definitions of linguistic categories. In Cognitive linguistics it has been argued that linguistic categories have a prototype structure like that of the categories of common words in a language. == Linguistic category inventories == To facilitate the interoperability between lexical resources, linguistic annotations and annotation tools and for the systematic handling of linguistic categories across different theoretical frameworks, a number of inventories of linguistic categories have been developed and are being used, with examples as given below. The practical objective of such inventories is to perform quantitative evaluation (for language-specific inventories), to train NLP tools, or to facilitate cross-linguistic evaluation, querying or annotation of language data. At a theoretical level, the existence of universal categories in human language has been postulated, e.g., in Universal grammar, but also heavily criticized. === Part-of-Speech tagsets === Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. However, there are clearly many more categories and sub-categories. For nouns, the plural, possessive, and singular forms can be distinguished. In many languages words are also marked for their case (role as subject, object, etc.), grammatical gender, and so on; while verbs are marked for tense, aspect, and other things. In some tagging systems, different inflections of the same root word will get different parts of speech, resulting in a large number of tags. For example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus). Other tagging systems use a smaller number of tags and ignore fine differences or model them as features somewhat independent from part-of-speech. In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as Greek and Latin can be very large; tagging words in agglutinative languages such as Inuit languages may be virtually impossible. Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words were ambiguous in that language as in English. A morphosyntactic descriptor in the case of morphologically rich languages is commonly expressed using very short mnemonics, such as ncmsan for category = noun, type = common, gender = masculine, number = singular, case = accusative, animate = no. The most popular tag set for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project. === Multilingual annotation schemes === For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the EAGLES Guidelines. The "Expert Advisory Group on Language Engineering Standards" (EAGLES) was an initiative of the European Commission that ran within the DG XIII Linguistic Research and Engineering programme from 1994 to 1998, coordinated by Consorzio Pisa Ricerche, Pisa, Italy. The EAGLES guidelines provide guidance for markup to be used with text corpora, particularly for identifying features relevant in computational linguistics and lexicography. Numerous companies, research centres, universities and professional bodies across the European Union collaborated to produce the EAGLES Guidelines, which set out recommendations for de facto standards and rules of best practice for: Large-scale language resources (such as text corpora, computational lexicons and speech corpora); Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools; Means of assessing and evaluating resources, tools and products. The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe. A generation later, a similar effort was initiated by the research community under the umbrella of Universal Dependencies. Petrov et al. have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Subsequently, this was complemented with cross-lingual specifications for dependency syntax (Stanford Dependencies), and morphosyntax (Interset interlingua, partially building on the Multext-East/Eagles tradition) in the context of the Universal Dependencies (UD), an international cooperative project to create treebanks of the world's languages with cross-linguistically applicable ("universal") annotations for parts of speech, dependency syntax, and (optionally) morphosyntactic (morphological) features. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory. The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features (individual languages or resources can introduce additional features). In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g., aux:pass for an auxiliary (UD aux) used to mark passive voice. The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology, frame semantics and coreference. For phrase structure syntax, a comparable effort does not seem to exist, but the specifications of the Penn Treebank have been applied to (and extended for) a broad range of languages, e.g., Icelandic, Old English, Middle English, Middle Low German, Early Modern High German, Yiddish, Portuguese, Japanese, Arabic and Chinese. === Conventions for interlinear glosses === In linguistics, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines (inter- + linear), such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more lines of transcription known as an interlinear text or interlinear glossed text (IGT)—interlinear for short. Such glosses help the reader follow the relationship between the source text and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules. Wikipedia also provides a List of glossing abbreviations that draws on this and other sources. === General Ontology for Linguistic Description (GOLD) === GOLD ("General Ontology for Linguistic Description") is an ontology for descriptive linguistics. It gives a formalized account of the most basic categories and relations used in the scientific description of human language, e.g., as a formalization of interlinear glosses. GOLD was first introduced by Farrar and Langendoen (2003). Originally, it was envisioned as a solution to the problem of resolving disparate markup schemes for linguistic data, in particular data from endangered languages. However, GOLD is much more general and can be applied to all languages. In this function, GOLD overlaps with the ISO 12620 Data Category Registry (ISOcat); it is, however, more stringently structured. GOLD was maintained by the LINGUIST List and others from 2007 to 2010. The RELISH project created a mirro

    Read more →
  • DreamLab

    DreamLab

    DreamLab was a volunteer computing Android and iOS app launched in 2015 by Imperial College London and the Vodafone Foundation. It was discontinued on 2nd April 2025. == Description == The app helped to research cancer, COVID-19, new drugs and tropical cyclones. To do this, DreamLab accessed part of the device's processing power, with the user's consent, while the owner charged their smartphone, to speed up the calculations of the algorithms from Imperial College London. The aim of the tropical cyclone project was to prepare for climate change risks. Other projects aimed to find existing drugs and food molecules that could help people with COVID-19 and other diseases. The performance of 100,000 smartphones would reach the annual output of all research computers at Imperial College in just three months, with a nightly runtime of six hours. The app was developed in 2015 by the Garvan Institute of Medical Research in Sydney and the Vodafone Foundation. In May 2020, the project had over 490,000 registered users.

    Read more →
  • Microsoft SQL Server Master Data Services

    Microsoft SQL Server Master Data Services

    Microsoft SQL Server Master Data Services (MDS) is a Master Data Management (MDM) product from Microsoft that ships as a part of the Microsoft SQL Server relational database management system. Master data management (MDM) allows an organization to discover and define non-transactional lists of data, and compile maintainable, reliable master lists. Master Data Services first shipped with Microsoft SQL Server 2008 R2. Microsoft SQL Server 2016 introduced enhancements to Master Data Services, such as improved performance and security, and the ability to clear transaction logs, create custom indexes, share entity data between different models, and support for many-to-many relationships. == Overview == In Master Data Services, the model is the highest level container in the structure of your master data. You create a model to manage groups of similar data. A model contains one or more entities, and entities contain members that are the data records. An entity is similar to a table. Like other MDM products, Master Data Services aims to create a centralized data source and keep it synchronized, and thus reduce redundancies, across the applications which process the data. Sharing the architectural core with Stratature +EDM, Master Data Services uses a Microsoft SQL Server database as the physical data store. It is a part of the Master Data Hub, which uses the database to store and manage data entities. It is a database with the software to validate and manage the data, and keep it synchronized with the systems that use the data. The master data hub has to extract the data from the source system, validate, sanitize and shape the data, remove duplicates, and update the hub repositories, as well as synchronize the external sources. The entity schemas, attributes, data hierarchies, validation rules and access control information are specified as metadata to the Master Data Services runtime. Master Data Services does not impose any limitation on the data model. Master Data Services also allows custom Business rules, used for validating and sanitizing the data entering the data hub, to be defined, which is then run against the data matching the specified criteria. All changes made to the data are validated against the rules, and a log of the transaction is stored persistently. Violations are logged separately, and optionally the owner is notified, automatically. All the data entities can be versioned. Master Data Services allows the master data to be categorized by hierarchical relationships, such as employee data are a subtype of organization data. Hierarchies are generated by relating data attributes. Data can be automatically categorized using rules, and the categories are introspected programmatically. Master Data Services can also expose the data as Microsoft SQL Server views, which can be pulled by any SQL-compatible client. It uses a role-based access control system to restrict access to the data. The views are generated dynamically, so they contain the latest data entities in the master hub. It can also push out the data by writing to some external journals. Master Data Services also includes a web-based UI for viewing and managing the data. It uses ASP.NET in the back-end. The Silverlight front-end was replaced with HTML5 in SQL Server 2019. Master Data Services provides a Web service interface to expose the data, as well as an API, which internally uses the exposed web services, exposing the feature set, programmatically, to access and manipulate the data. It also integrates with Active Directory for authentication purposes. Unlike +EDM, Master Data Services supports Unicode characters, as well as support multilingual user interfaces. SQL Server 2016 introduced a significant performance increase in Master Data Services over previous versions. == Terminology == Model is the highest level of an MDS instance. It is the primary container for specific groupings of master data. In many ways it is very similar to the idea of a database. Entities are containers created within a model. Entities provide a home for members, and are in many ways analogous to database tables. (e.g. Customer) Members are analogous to the records in a database table (Entity) e.g. Will Smith. Members are contained within entities. Each member is made up of two or more attributes. Attributes are analogous to the columns within a database table (Entity) e.g. Surname. Attributes exist within entities and help describe members (the records within the table). Name and Code attributes are created by default for each entity and serve to describe and uniquely identify leaf members. Attributes can be related to other attributes from other entities which are called 'domain-based' attributes. This is similar to the concept of a foreign key. Other attributes however, will be of type 'free-form' (most common) or 'file'. Attribute Groups are explicitly defined collections of particular attributes. Say you have an entity "customer" that has 50 attributes — too much information for many of your users. Attribute groups enable the creation of custom sets of hand-picked attributes that are relevant for specific audiences. (e.g. "customer - delivery details" that would include just their name and last known delivery address). This is very similar to a database view. Hierarchies organize members into either Derived or Explicit hierarchical structures. Derived hierarchies, as the name suggests, are derived by the MDS engine based on the relationships that exist between attributes. Explicit hierarchies are created by hand using both leaf and consolidated members. Business Rules can be created and applied against model data to ensure that custom business logic is adhered to. In order to be committed into the system data must pass all business rule validations applied to them. e.g. Within the Customer Entity you may want to create a business rule that ensures all members of the 'Country' Attribute contain either the text "USA" or "Canada". The Business Rule once created and ran will then verify all the data is correct before it accepts it into the approved model. Versions provide system owners / administrators with the ability to Open, Lock or Commit a particular version of a model and the data contained within it at a particular point in time. As the content within a model varies, grows or shrinks over time versions provide a way of managing metadata so that subscribing systems can access to the correct content.

    Read more →
  • Knowledge graph

    Knowledge graph

    In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities. Since the development of the Semantic Web, knowledge graphs have often been associated with linked open data projects, focusing on the connections between concepts and entities. They are also historically associated with and used by search engines such as Google, Bing, and Yahoo; knowledge engines and question-answering services such as WolframAlpha, Apple's Siri, and Amazon Alexa; and social networks such as LinkedIn and Facebook. Recent developments in data science and machine learning, particularly in graph neural networks, representation learning, and machine learning, have broadened the scope of knowledge graphs beyond their traditional use in search engines and recommender systems. They are increasingly used in scientific research, with notable applications in fields such as genomics, proteomics, and systems biology. == History == The term was coined as early as 1972 by the Austrian linguist Edgar W. Schneider, in a discussion of how to build modular instructional systems for courses. In the late 1980s, the University of Groningen and University of Twente jointly began a project called Knowledge Graphs, focusing on the design of semantic networks with edges restricted to a limited set of relations, to facilitate algebras on the graph. In subsequent decades, the distinction between semantic networks and knowledge graphs was blurred. Some early knowledge graphs were topic-specific. In 1985, Wordnet was founded, capturing semantic relationships between words and meanings – an application of this idea to language itself. In 2005, Marc Wirk founded Geonames to capture relationships between different geographic names and locales and associated entities. In 1998, Andrew Edmonds of Science in Finance Ltd in the UK created a system called ThinkBase that offered fuzzy-logic based reasoning in a graphical context. In 2007, both DBpedia and Freebase were founded as graph-based knowledge repositories for general-purpose knowledge. DBpedia focused exclusively on data extracted from Wikipedia, while Freebase also included a range of public datasets. Neither described themselves as a 'knowledge graph' but developed and described related concepts. In 2012, Google introduced their Knowledge Graph, building on DBpedia and Freebase among other sources. They later incorporated RDFa, Microdata, JSON-LD content extracted from indexed web pages, including the CIA World Factbook, Wikidata, and Wikipedia. Entity and relationship types associated with this knowledge graph have been further organized using terms from the schema.org vocabulary. The Google Knowledge Graph became a complement to string-based search within Google, and its popularity online brought the term into more common use. Since then, several large multinationals have advertised their use of knowledge graphs, further popularising the term. These include Facebook, LinkedIn, Airbnb, Microsoft, Amazon, Uber and eBay. In 2019, IEEE combined its annual international conferences on "Big Knowledge" and "Data Mining and Intelligent Computing" into the International Conference on Knowledge Graph. The development of large language models expanded interest in knowledge graphs as a way to structure information from unstructured text, with advances in language processing enabling their automatic or semi-automatic generation and expansion. The term knowledge graph has since broadened to include the dynamically constructed and adaptive graph structures, which support retrieval, reasoning, and summarization in generative systems. Microsoft Research's GraphRAG (2024) exemplified this development by integrating LLM-generated graphs into retrieval-augmented generation. == Definitions == There is no single commonly accepted definition of a knowledge graph. Most definitions view the topic through a Semantic Web lens and include these features: Flexible relations among knowledge in topical domains: A knowledge graph (i) defines abstract classes and relations of entities in a schema, (ii) mainly describes real world entities and their interrelations, organized in a graph, (iii) allows for potentially interrelating arbitrary entities with each other, and (iv) covers various topical domains. General structure: A network of entities, their semantic types, properties, and relationships. To represent properties, categorical or numerical values are often used. Supporting reasoning over inferred ontologies: A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge. There are, however, many knowledge graph representations for which some of these features are not relevant. For those knowledge graphs, this simpler definition may be more useful: A digital structure that represents knowledge as concepts and the relationships between them (facts). A knowledge graph can include an ontology that allows both humans and machines to understand and reason about its contents. === Implementations === In addition to the above examples, the term has been used to describe open knowledge projects such as YAGO and Wikidata; federations like the Linked Open Data cloud; a range of commercial search tools, including Yahoo's semantic search assistant Spark, Google's Knowledge Graph, and Microsoft's Satori; and the LinkedIn and Facebook entity graphs. The term is also used in the context of note-taking software applications that allow a user to build a personal knowledge graph. The popularization of knowledge graphs and their accompanying methods have led to the development of graph databases such as Neo4j, GraphDB and AgensGraph. These graph databases allow users to easily store data as entities and their interrelationships, and facilitate operations such as data reasoning, node embedding, and ontology development on knowledge bases. In contrast, virtual knowledge graphs do not store information in specialized databases. They rely on an underlying relational database or data lake to answer queries on the graph. Such a virtual knowledge graph system must be properly configured in order to answer the queries correctly. This specific configuration is done through a set of mappings that define the relationship between the elements of the data source and the structure and ontology of the virtual knowledge graph. == Using a knowledge graph for reasoning over data == A knowledge graph formally represents semantics by describing entities and their relationships. Knowledge graphs may make use of ontologies as a schema layer. By doing this, they allow logical inference for retrieving implicit knowledge rather than only allowing queries requesting explicit knowledge. In order to allow the use of knowledge graphs in various machine learning tasks, several methods for deriving latent feature representations of entities and relations have been devised. These knowledge graph embeddings allow them to be connected to machine learning methods that require feature vectors like word embeddings. This can complement other estimates of conceptual similarity. Models for generating useful knowledge graph embeddings are commonly the domain of graph neural networks (GNNs). GNNs are deep learning architectures that comprise edges and nodes, which correspond well to the entities and relationships of knowledge graphs. The topology and data structures afforded by GNNs provide a convenient domain for semi-supervised learning, wherein the network is trained to predict the value of a node embedding (provided a group of adjacent nodes and their edges) or edge (provided a pair of nodes). These tasks serve as fundamental abstractions for more complex tasks such as knowledge graph reasoning and alignment. === Entity alignment === As new knowledge graphs are produced across a variety of fields and contexts, the same entity will inevitably be represented in multiple graphs. However, because no single standard for the construction or representation of knowledge graph exists, resolving which entities from disparate graphs correspond to the same real world subject is a non-trivial task. This task is known as knowledge graph entity alignment, and is an active area of research. Strategies for entity alignment generally seek to identify similar substructures, semantic relationships, shared attributes, or combinations of all three between two distinct knowledge graphs. Entity alignment methods use these structural similarities between generally non-isomorphic graphs to predict which nodes correspond to the same entity. In 2023, researchers found success in using large language models (LLMs) in the task of entity alignment. This was in particul

    Read more →
  • Golden record (informatics)

    Golden record (informatics)

    In informatics, a golden record is the valid version of a data element (record) in a single source of truth system. It may refer to a database, specific table or data field, or any unit of information used. A golden copy is a consolidated data set, and is supposed to provide a single source of truth and a "well-defined version of all the data entities in an organizational ecosystem". Other names sometimes used include master source or master version. The term has been used in conjunction with data quality, master data management, and similar topics. (Different technical solutions exist, see master data management). == Master data == In master data management (MDM), the golden copy refers to the master data (master version) of the reference data which works as an authoritative source for the "truth" for all applications in a given IT landscape.

    Read more →
  • Meta AI

    Meta AI

    Meta AI is a research division of Meta (formerly Facebook) that develops artificial intelligence and augmented reality technologies. == History == Meta AI was founded in 2013 as Facebook Artificial Intelligence Research (FAIR). It has workspaces in Menlo Park, London, New York City, Paris, Seattle, Pittsburgh, Tel Aviv, and Montreal as of 2025. In 2016, FAIR partnered with Google, Amazon, IBM, and Microsoft in creating the Partnership on Artificial Intelligence to Benefit People and Society. Meta AI was directed by Yann LeCun until 2018, when Jérôme Pesenti succeeded the role. Pesenti is formerly the CTO of IBM's big data group. FAIR's research includes self-supervised learning, generative adversarial networks, document classification and translation, and computer vision. FAIR released Torch deep-learning modules as well as PyTorch in 2017, an open-source machine learning framework, which was subsequently used in several deep learning technologies, such as Tesla's autopilot and Uber's Pyro. That same year, a pair of chatbots were falsely rumored to be discontinued for developing a language that was unintelligible to humans. FAIR clarified that the research had been shut down because they had accomplished their initial goal to understand how languages are generated by their models, rather than out of fear. FAIR was renamed Meta AI following the rebranding that changed Facebook, Inc. to Meta Platforms Inc. On October 1, 2025, Facebook announced "We will soon use your interactions with AI at Meta to personalize the content and ads you see". == Virtual assistant == Meta AI is also the name of the virtual assistant developed by the team, now integrated as a chatbot into Meta's social networking products. It is also available as a subscription-based stand-alone app. The virtual assistant was pre-installed on the second generation of Ray-Ban Meta smartglasses, and can incorporate inputs from the glasses' cameras after an update. It is also available on Quest 2 and newer HMDs. Since May 2024, the chatbot has summarized news from various outlets without linking directly to original articles, including in Canada, where news links are banned on its platforms. This use of news content without compensation and attribution has raised ethical and legal concerns, especially as Meta continues to reduce news visibility on its platforms. == Current research == === Natural language processing and chatbot === Natural language processing is the ability for machines to understand and generate natural language. The team is also researching unsupervised machine translation and multilingual chatbots. ==== Galactica ==== Galactica is a large language model (LLM) designed for generating scientific text. It was available for three days from 15 November 2022, before being withdrawn for generating racist and inaccurate content. ==== Llama ==== Llama is an LLM released in February 2023. As of January 2026, the most recent release is the Llama 4. === Hardware === Meta used CPUs and in-house custom chips before 2022; they switched to Nvidia GPUs since then. MTIA v1, one of their early chips, is designed for the company's content recommendation algorithms. It was fabricated on TSMC's 7 nm process technology and consumed 25W, capable of 51.2 TFlops FP16. == Controversy == The French media outlet Mediapart reports that in 2022, Facebook's parent company illegally used works accumulated by the pirate site LibGen to train its artificial intelligence.

    Read more →
  • Artificial intelligence industry in Italy

    Artificial intelligence industry in Italy

    The artificial intelligence industry in Italy is growing and supports industrial development. In 2024 it reached a new record, reaching 1.2 billion euros with a growth of +58% compared to 2023. While in 2025, the growth of artificial intelligence in the industrial application was even greater than in 2024 both in terms of value and application to industrial sectors. == History == The roots of AI research in Italy extend back to the 1970s, when Italian scholars began exploring automated reasoning, programming language semantics, and pattern recognition. Researchers such as those involved in early projects at the National Research Council and various universities laid the groundwork for subsequent academic and industrial developments in the field. During this period, the focus was predominantly on developing algorithms for automated theorem proving and building systems to reason about complex mathematical problems. This era witnessed the birth of methodologies that would later influence numerous AI subfields, from natural language processing (NLP) to robotics. === Institutional milestones and academic contributions === A turning point in the Italian AI landscape was the formation of the Italian Association for Artificial Intelligence (AIxIA) in 1988. Founded by academics, including Luigia Carlucci Aiello, the association established a platform for collaboration between universities, research centers, and industry. Led by Aiello, AIIA played a role in promoting research, organizing national conferences, and fostering international partnerships that connected Italy's AI community to global networks. At the same time, professors such as Roberto Navigli and numerous practitioners contributed to the advancement of AI in Italy. Navigli has worked in multilingual NLP, including the creation of BabelNet, and led the Minerva project. === Industrial AI === Over recent decades, numerous national and European initiatives supported by funding from programs such as the National Recovery and Resilience Plan (PNRR) have spurred the transition from theoretical research to practical applications. Industrial sectors including manufacturing, banking, and healthcare increasingly embraced AI-driven automation, while research institutions collaborated with industrial partners to deploy cutting-edge solutions. In recent years, Italy has also seen the establishment of specialized research centers and institutes aimed at bridging the gap between academic innovation and industrial application. These initiatives indicate a broader national commitment to integrating AI into the fabric of Italian industry. == Recent developments == === Emergence of generative AI === A landmark in Italy's modern AI evolution is the development of Minerva AI. Developed by the Sapienza NLP research group at Sapienza University of Rome and led by Professor Roberto Navigli, Minerva represents the first family of large language models (LLMs) trained from scratch with a primary focus on the Italian language. ==== Minerva 7B ==== The latest iteration, Minerva 7B, has 7 billion parameters and has been trained on an extensive corpus of over 1.5 trillion words. By using advanced instruction tuning techniques, Minerva 7B is able to produce highly accurate, coherent, and contextually sensitive responses addressing common issues such as hallucinations and inappropriate content generation. This breakthrough sets a benchmark for transparent, open-source AI development in the country. Minerva's development, carried out within the FAIR (Future Artificial Intelligence Research) project in collaboration with CINECA and supported by supercomputing resources like the Leonardo (supercomputer), aligns closely with Italy's cultural and linguistic heritage. === Establishment of AI4I === The recent establishment of the Istituto Italiano per l’Intelligenza Artificiale (AI4I) is part of Italy's strategy to improve its industrial competitiveness in AI. This dedicated institute aims to bridge the gap between research institutions and industrial enterprises; promote training and R&D support to nurture the next generation of Italian AI experts; and enhance national competitiveness. This initiative is expected to serve as a hub for applied AI research, driving innovations that are tailored to the specific needs of Italian industry and public administration. === Benefits of InvestAI === Italy's AI industry stands to benefit from the European InvestAI initiative, a plan unveiled at the recent AI Action Summit in Paris. InvestAI is an effort by the European Commission to mobilize €200 billion for AI investments, with a dedicated €20 billion fund earmarked for building AI gigafactories. These gigafactories are planned as large-scale hubs for training advanced, complex AI models using approximately 100,000 last-generation AI chips. For Italy, this investment presents several major opportunities: Access to State-of-the-Art Infrastructure: Italian companies, research institutions, and start-ups can leverage the gigafactories’ immense computational resources, enabling them to train highly sophisticated language models and other AI systems. Enhanced Competitiveness and Collaboration: With InvestAI's layered funding model where EU funds help de-risk private investments Italian firms can access capital more readily. This will bolster public–private partnerships and create a more dynamic AI ecosystem that spans from academic research to industrial applications. Alignment with National and Regional Initiatives: The Istituto Italiano per l’Intelligenza Artificiale (AI4I), based in Turin, is already recognized as a strategic asset by both Italy and the European Union. As the main recipient of InvestAI funds in Italy, AI4I will play a pivotal role in implementing these investments locally, fostering innovation in sectors like manufacturing, healthcare and aerospace. Commission President Ursula von der Leyen emphasized that InvestAI is designed to democratize AI innovation throughout Europe by ensuring that even smaller companies have access to high-performance computing power. For Italy, this means not only keeping pace with global leaders but also harnessing European-scale investments to transform its AI industry and drive economic growth.

    Read more →
  • Tertiary source

    Tertiary source

    A tertiary source is an index or textual consolidation of already published primary and secondary sources that does not provide additional interpretations or analysis of the sources. Some tertiary sources can be used as an aid to find key (seminal) sources, key terms, general common knowledge and established mainstream science on a topic. The exact definition of tertiary varies by academic field. Academic research standards generally do not accept tertiary sources such as encyclopedias as citations, although survey articles are frequently cited rather than the original publication. == Overlap with secondary sources == As is also the case with distinguishing primary and secondary sources in some disciplines, there is not always a clear distinguishing line between secondary and tertiary sources. Depending on the topic of research, a scholar may use a bibliography, dictionary, or encyclopedia as either a tertiary or a secondary source. This causes some difficulty in defining many sources as either one type or the other. In some academic disciplines, the differentiation between a secondary and tertiary source is relative. In the United Nations International Scientific Information System (UNISIST) model, a secondary source is a bibliography, whereas a tertiary source is a synthesis of primary sources. == Types of tertiary sources == Tertiary sources can come in book form or as an online resource. Tertiary sources in book form are frequently organised in alphabetical order, whereas an online tertiary source may be searchable by keyword. Examples of tertiary sources include; reference books, encyclopedias, dictionaries, some textbooks, abstracts, directories, factbooks, handbooks, manuals and compendia. Indexes, bibliographies, concordances, and databases are aggregates of primary and secondary sources and therefore often considered tertiary sources. They may also serve as a point of access to the full or partial text of primary and secondary sources. Almanacs, travel guides, field guides, and timelines are also examples of tertiary sources. Tertiary sources attempt to summarize, collect, and consolidate the source materials into an overview without adding analysis and synthesis of new conclusions. Wikipedia is a tertiary source.

    Read more →
  • Penril

    Penril

    Penril DataComm Networks, Inc. was a computer telecommunications hardware company that made some acquisitions and was eventually split into two parts: one was acquired by Bay Networks and the other was a newly formed company named Access Beyond. The focus of both company's products was end-to-end data transfer. By the mid-1990s, with the popularization of the internet, this was no longer of wide interest. == History == Penril, whose earnings reports and other financials were followed by The New York Times in the 1990s, made several acquisitions but also grew internally. Following its Datability acquisition it renamed itself Penril Datability Networks. By the time the 1968-founded Penril was acquired by Bay their name was Penril DataComm Networks. The company, which as of 1985 "had made 14 acquisitions in 12 years," also had done extensive work regarding quality control, and leveraged their product line by what The Washington Post called clever packaging: "software, cables, instructions and telephone support" sold to those less technically skilled as "Network in a Box." == Datability == Datability Software Systems Inc. was the initial name of what by 1991 became 'Datability, Inc.', "a manufacturer of hardware that links computer networks." The 1977-founded firm began as a software consulting company, especially in the area of databases. To speed up project development they built a program generator, which they marketed as Control 10/20 (targeted at users of Digital Equipment Corporation's DECsystem-10 and DECSYSTEM-20). After trying their hand at time-sharing they built hardware to enhance bridging these computers to DEC's VAX product line. In particular they focused on Digital's LAT protocol, selling "boxes" that reimplemented the protocol, at a lower price than DEC's. They later expanded into other areas of telecommunications hardware The firm relocated to a larger manufacturing plant in 1991 and was acquired by Penril in 1993. == Access Beyond == Access Beyond was initially housed by Penril, from which it was spun off. A securities analyst noted that Access began operations with no debt. They subsequently merged with Hayes Corporation. Some of the funds brought to the merger came from a sale by Penril of two of its divisions, each bringing about $4 million. == Ron Howard == Ron Howard, founder of Datability, became part of Penril when the latter acquired the former, and was CEO of Access Beyond when it was spun off by Penril. Access merged with Hayes Microcomputer Products and was renamed Hayes Corp, at which time Howard became executive VP of business development and corporate vice chairman of Hayes. == People == In the matter of hiring immigrants, in an industry where recent arrivals came from a culture of six day work weeks, and subcontracting was then common, these assembly line workers at Penril comprised about 25%, compared to double in other firms. Placement was overseen by government agencies. == Controversy == Penril had a joint development agreement, beginning in 1990, with a Standard Microsystems Corporation (SMSC) subsidiary. A dispute arose, and the matter was brought to court. Penril was awarded $3.5 million in 1996.

    Read more →
  • Resolution enhancement technology

    Resolution enhancement technology

    Resolution enhancement technology (RET) is a form of image processing technology used to manipulate dot characteristics popular among laser printer and inkjet printer manufacturers. Closely related RET techniques are also used in VLSI photolithography manufacturing technology, in particular in relation to 90 nanometre technology. Resolution refers to the sharpness of image detail, smoothness of curved lines, and the faithful reproduction of an image. In both cases, RET uses pre-compensation of the image in order to try to mitigate the effects of the printing process. Among the major issues in RET in VLSI technology are the fundamental properties of a wave: amplitude, phase, and direction.

    Read more →
  • Conceptions of Library and Information Science

    Conceptions of Library and Information Science

    Conceptions of Library and Information Science (CoLIS) is a series of conferences about historical, empirical and theoretical perspectives in Library and Information Science. == CoLIS conferences == CoLIS 1 1991 in Tampere, Finland CoLIS 2 1996 in Copenhagen, Denmark CoLIS 3 1999 in Dubrovnik, Croatia CoLIS 4 2002 in Seattle, US CoLIS 5 2005 in Glasgow, Scotland CoLIS 6 2007 in Borås, Sweden CoLIS 7 June 2010 in London, at City University London. CoLIS 8 August 19–22, 2013, in Copenhagen, Denmark, at The Royal School of Library and Information Science. CoLIS 9 June 27–29, 2016, in Uppsala, Sweden, at Uppsala University. CoLIS 10 June 16–19, 2019, in Ljubljana, Slovenia, Faculty of Arts CoLIS 11 May 29–June 1, 2022, in Oslo, Norway, Oslo Metropolitan University.

    Read more →
  • Lancichinetti–Fortunato–Radicchi benchmark

    Lancichinetti–Fortunato–Radicchi benchmark

    Lancichinetti–Fortunato–Radicchi benchmark is an algorithm that generates benchmark networks (artificial networks that resemble real-world networks). They have a priori known communities and are used to compare different community detection methods. The advantage of the benchmark over other methods is that it accounts for the heterogeneity in the distributions of node degrees and of community sizes. == The algorithm == The node degrees and the community sizes are distributed according to a power law, with different exponents. The benchmark assumes that both the degree and the community size have power law distributions with different exponents, γ {\displaystyle \gamma } and β {\displaystyle \beta } , respectively. N {\displaystyle N} is the number of nodes and the average degree is ⟨ k ⟩ {\displaystyle \langle k\rangle } . There is a mixing parameter μ {\displaystyle \mu } , which is the average fraction of neighboring nodes of a node that do not belong to any community that the benchmark node belongs to. This parameter controls the fraction of edges that are between communities. Thus, it reflects the amount of noise in the network. At the extremes, when μ = 0 {\displaystyle \mu =0} all links are within community links, if μ = 1 {\displaystyle \mu =1} all links are between nodes belonging to different communities. One can generate the benchmark network using the following steps. Step 1: Generate a network with nodes following a power law distribution with exponent γ {\displaystyle \gamma } and choose extremes of the distribution k min {\displaystyle k_{\min }} and k max {\displaystyle k_{\max }} to get desired average degree is ⟨ k ⟩ {\displaystyle \langle k\rangle } . Step 2: ( 1 − μ ) {\displaystyle (1-\mu )} fraction of links of every node is with nodes of the same community, while fraction μ {\displaystyle \mu } is with the other nodes. Step 3: Generate community sizes from a power law distribution with exponent β {\displaystyle \beta } . The sum of all sizes must be equal to N {\displaystyle N} . The minimal and maximal community sizes s min {\displaystyle s_{\min }} and s max {\displaystyle s_{\max }} must satisfy the definition of community so that every non-isolated node is in at least in one community: s min > k min {\displaystyle s_{\min }>k_{\min }} s max > k max {\displaystyle s_{\max }>k_{\max }} Step 4: Initially, no nodes are assigned to communities. Then, each node is randomly assigned to a community. As long as the number of neighboring nodes within the community does not exceed the community size a new node is added to the community, otherwise stays out. In the following iterations the “homeless” node is randomly assigned to some community. If that community is complete, i.e. the size is exhausted, a randomly selected node of that community must be unlinked. Stop the iteration when all the communities are complete and all the nodes belong to at least one community. Step 5: Implement rewiring of nodes keeping the same node degrees but only affecting the fraction of internal and external links such that the number of links outside the community for each node is approximately equal to the mixing parameter μ {\displaystyle \mu } . == Testing == Consider a partition into communities that do not overlap. The communities of randomly chosen nodes in each iteration follow a p ( C ) {\displaystyle p(C)} distribution that represents the probability that a randomly picked node is from the community C {\displaystyle C} . Consider a partition of the same network that was predicted by some community finding algorithm and has p ( C 2 ) {\displaystyle p(C_{2})} distribution. The benchmark partition has p ( C 1 ) {\displaystyle p(C_{1})} distribution. The joint distribution is p ( C 1 , C 2 ) {\displaystyle p(C_{1},C_{2})} . The similarity of these two partitions is captured by the normalized mutual information. I n = ∑ C 1 , C 2 p ( C 1 , C 2 ) log 2 ⁡ p ( C 1 , C 2 ) p ( C 1 ) p ( C 2 ) 1 2 H ( { p ( C 1 ) } ) + 1 2 H ( { p ( C 2 ) } ) {\displaystyle I_{n}={\frac {\sum _{C_{1},C_{2}}p(C_{1},C_{2})\log _{2}{\frac {p(C_{1},C_{2})}{p(C_{1})p(C_{2})}}}{{\frac {1}{2}}H(\{p(C_{1})\})+{\frac {1}{2}}H(\{p(C_{2})\})}}} If I n = 1 {\displaystyle I_{n}=1} the benchmark and the detected partitions are identical, and if I n = 0 {\displaystyle I_{n}=0} then they are independent of each other.

    Read more →