AI Analytics Dashboard

AI Analytics Dashboard — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Wilkinson's Grammar of Graphics

    Wilkinson's Grammar of Graphics

    The Grammar of Graphics (GoG) is a grammar-based system for representing graphics to provide grammatical constraints on the composition of data and information visualizations. A graphical grammar differs from a graphics pipeline as it focuses on semantic components such as scales and guides, statistical functions, coordinate systems, marks and aesthetic attributes. For example, a bar chart can be converted into a pie chart by specifying a polar coordinate system without any other change in graphical specification. The grammar of graphics concept was launched by Leland Wilkinson in 2001 (Wilkinson et al., 2001; Wilkinson, 2005) and graphical grammars have since been written in a variety of languages with various parameterisations and extensions. The major implementations of graphical grammars are nViZn created by a team at SPSS/IBM, followed by Polaris focusing on multidimensional relational databases which is commercialised as Tableau, a revised Layered Grammar of Graphics by Hadley Wickham in Ggplot2, and Vega-Lite which is a visualisation grammar with added interactivity. The grammar of graphics continues to evolve with alternate parameterisations, extensions, or new specifications. == Wilkinson's Grammar of Graphics == === Theory === Wilkinson conceived the seven elements of a graphics to be Variables: mapping of objects to values represented in a graphic Algebra: operations to combine variables and specify dimensions of graphs Geometry: creation of geometric graphs from variables Aesthetics: sensory attributes Statistics: functions to change the appearance and representation of graphs Scales: represent variables on measured dimensions Coordinates: mapping to coordinate systems With these, Wilkinson hypothesised that These seven constructs are orthogonal and virtually all known statistical charts can be generated relatively parsimoniously This computational system is not a taxonomy of charts and rather it describes the meaning of what we do when we construct statistical graphics. === Implementations === Wilkinson wrote SYSTAT, a statistical software package, in the early 1980s. This program was noted for its comprehensive graphics, including the first software implementation of the heatmap display now widely used among biologists. After his company grew to 50 employees, he sold it to SPSS in 1995. At SPSS, he assembled a team of graphics programmers who developed the nViZn platform that produces the visualizations in SPSS, Clementine, and other analytics products. While at Stanford, Tableau founders Hanrahan and Stolte, as well as Diane Tang, created the predecessor to Tableau, named Polaris. Polaris was a data visualization software tool, built with the support of a United States Department of Energy defense program, the Accelerated Strategic Computing Initiative (ASCI). The main differences between Wilkinson's system and Polaris are the use of SQL relational algebra for database services and using shelves instead of cross and nest operators. == Wickham's Layered Grammar of Graphics == === Theory === Hadley Wickham conceived an alternate parameterisation of the syntax Wilkinson had derived, creating a layered grammar of graphics which he implemented as ggplot2 for R (programming language) users. This added a hierarchy of defaults based around the idea of building up a graphic from multiple layers. Wickham conceived these elements to be: Defaults: consists of data and mapping Data: dataset Mapping: aesthetic mappings Layer: consists of data, mapping, geom, stat, and position Data: dataset, or inherit from defaults Mapping: aesthetic mappings, or inherit from defaults Geom: geometric object Stat: statistical transformation Position: position adjustment Scale: mapping of data to aesthetic attributes Coord: mapping of data to the plane of the plot Facet: split up the data === Reception === Wilkinson is generally positive on Wickham's parameterisation and implementation of ggplot2, praising its elegance and expressivity whilst claiming that his original Grammar of Graphics is capable of representing a wider range of statistical graphics. === Implementations === ggplot2 is the first implementation of a layered grammar of graphics in R and implementations in other programming languages have ensued. These include direct ports plotnine for Python, gramm for MATLAB, Lets-Plot for Kotlin and gadfly for Julia. Projects inspired by elements of Wickham's grammar include Vega-Lite which specifies plots in JSON and uses a JavaScript engine. Implementations for Python include Vega-Altair (built on top of Vega-Lite). == Vega-Lite: A Grammar of Interactive Graphics == === Theory === Vega-Lite combines ideas from Wilkinson's Grammar of Graphics and Wickham's Layered Grammar of Graphics with a composition algebra for layered and multi-view displays with a grammar of interaction. The Vega-Lite specification is instantiated in JSON and rendered by the lower-level Vega. The graphical grammar implemented by Vega-Lite is composed of the following: Unit: consists of data, transforms, mark-type and encoding Data: relational table consisting of records (rows) and named attributes (columns) Transforms: data transformations Mark-type: geometric object for visual encoding Encodings: mapping of data attributes to visual marks properties where each encoding consists of: Channel: e.g. colour, shape, size, or text Field: data attribute Data-type: e.g. nominal, ordinal, quantitative, or temporal Value: use a literal instead of a data-type Functions: e.g. binning, aggregation, and sorting Scale: maps from data domain to visual range Guide: axis or legend for visualising scale Composite Views: compose views from multiple unit specifications with operators: Layer: charts plotted on top of each other Hconcat/Vconcat: place views side-by-side Facet: subset data to produce a trellis plot Repeat: multiple plots similar to facet but with full data replication in each cell Interaction: selections identify the set of points a user is interested in manipulating, with components: Selection: get the minimal number of backing points Name: reference Type: how many backing values are stored Predicate: determine the set of selected points e.g. single, list, interval Domain|Range: store data domain or visual range Event: e.g. mouseover, mousedown, mouseup, Init: initialise with specific backing points Transforms: e.g. project, toggle, translate, zoom, and nearest Resolve: resolve selections to union or intersect ==== Implementations ==== Whilst Vega-Lite is the sole implementation of this graphics grammar specification with compilation to Vega, other implementations do create JSON files which can be interpreted by Vega-Lite. == Related projects == Ggplot2 is an R package for plotting Tableau Software (originally known as Polaris) is a commercial software built using the Grammar of Graphics nViZn built by Wilkinson. SYSTAT (statistics package) built by Wilkinson ggpy, ggplot for Python, but has not been updated since 20 November 2016 plotnine started as an effort to improve the scalability of ggplot for Python and is largely compatible with ggplot2 syntax. Plotly - Interactive, online ggplot2 graphs gramm, a plotting class for MATLAB inspired by ggplot2 gadfly, a system for plotting and visualization written in Julia, based largely on ggplot2 Chart::GGPlot - ggplot2 port in Perl, but has not been updated since 16 March 2023 The Lets-Plot for Python library includes a native backend and a Python API, which was mostly based on the ggplot2 package. Lets-Plot Kotlin API is an open-source plotting library for statistical data implemented using the Kotlin programming language, and is built on the principles of layered graphics first described in the Leland Wilkinson's work The Grammar of Graphics. ggplotnim, plotting library using the Nim programming language inspired by ggplot2. Vega and Vega-Lite are plotting libraries that use JSON to specify plots. Vega-Altair, a Python library built on top of Vega-Lite chart-parts - React-friendly Grammar of Graphics, but has not been updated since 10 Dec 2021 g2 - a JavaScript library

    Read more →
  • Multiple encryption

    Multiple encryption

    Multiple encryption is the process of encrypting an already encrypted message one or more times, either using the same or a different algorithm. It is also known as cascade encryption, cascade ciphering, cipher stacking, multiple encryption, and superencipherment. Superencryption refers to the outer-level encryption of a multiple encryption. Some cryptographers, like Matthew Green of Johns Hopkins University, say multiple encryption addresses a problem that mostly doesn't exist: Modern ciphers rarely get broken... You’re far more likely to get hit by malware or an implementation bug than you are to suffer a catastrophic attack on Advanced Encryption Standard (AES). However, from the previous quote an argument for multiple encryption can be made, namely poor implementation. Using two different cryptomodules and keying processes from two different vendors requires both vendors' wares to be compromised for security to fail completely. == Independent keys == Picking any two ciphers, if the key used is the same for both, the second cipher could possibly undo the first cipher, partly or entirely. This is true of ciphers where the decryption process is exactly the same as the encryption process (a reciprocal cipher) – the second cipher would completely undo the first. If an attacker were to recover the key through cryptanalysis of the first encryption layer, the attacker could possibly decrypt all the remaining layers, assuming the same key is used for all layers. To prevent that risk, one can use keys that are statistically independent for each layer (e.g. independent RNGs). Ideally each key should have separate and different generation, sharing, and management processes. == Independent Initialization Vectors == For en/decryption processes that require sharing an Initialization Vector (IV) / nonce these are typically, openly shared or made known to the recipient (and everyone else). Its good security policy never to provide the same data in both plaintext and ciphertext when using the same key and IV. Therefore, its recommended (although at this moment without specific evidence) to use separate IVs for each layer of encryption. == Importance of the first layer == With the exception of the one-time pad, no cipher has been theoretically proven to be unbreakable. Furthermore, some recurring properties may be found in the ciphertexts generated by the first cipher. Since those ciphertexts are the plaintexts used by the second cipher, the second cipher may be rendered vulnerable to attacks based on known plaintext properties (see references below). This is the case when the first layer is a program P that always adds the same string S of characters at the beginning (or end) of all ciphertexts (commonly known as a magic number). When found in a file, the string S allows an operating system to know that the program P has to be launched in order to decrypt the file. This string should be removed before adding a second layer. To prevent this kind of attack, one can use the method provided by Bruce Schneier: Generate a random pad R of the same size as the plaintext. Encrypt R using the first cipher and key. XOR the plaintext with the pad, then encrypt the result using the second cipher and a different (!) key. Concatenate both ciphertexts in order to build the final ciphertext. A cryptanalyst must break both ciphers to get any information. This will, however, have the drawback of making the ciphertext twice as long as the original plaintext. Note, however, that a weak first cipher may merely make a second cipher that is vulnerable to a chosen plaintext attack also vulnerable to a known plaintext attack. However, a block cipher must not be vulnerable to a chosen plaintext attack to be considered secure. Therefore, the second cipher described above is not secure under that definition, either. Consequently, both ciphers still need to be broken. The attack illustrates why strong assumptions are made about secure block ciphers and ciphers that are even partially broken should never be used. == The Rule of Two == The Rule of Two is a data security principle from the NSA's Commercial Solutions for Classified Program (CSfC). It specifies two completely independent layers of cryptography to protect data. For example, data could be protected by both hardware encryption at its lowest level and software encryption at the application layer. It could mean using two FIPS-validated software cryptomodules from different vendors to en/decrypt data. The importance of vendor and/or model diversity between the layers of components centers around removing the possibility that the manufacturers or models will share a vulnerability. This way if one components is compromised there is still an entire layer of encryption protecting the information at rest or in transit. The CSfC Program offers solutions to achieve diversity in two ways. "The first is to implement each layer using components produced by different manufacturers. The second is to use components from the same manufacturer, where that manufacturer has provided NSA with sufficient evidence that the implementations of the two components are independent of one another." The principle is practiced in the NSA's secure mobile phone called Fishbowl. The phones use two layers of encryption protocols, IPsec and Secure Real-time Transport Protocol (SRTP), to protect voice communications. The Samsung Galaxy S9 Tactical Edition is also an approved CSfC Component.

    Read more →
  • Instagram face

    Instagram face

    Instagram face is a beauty standard based on the filters and influencers popular on Instagram. == Overview == An "Instagram face" has catlike eyes, long lashes, a small nose, high cheekbones, full lips, and a blank expression. Digital filters manipulate photographs and video to create an idealized image that, according to critics, has resulted in an unrealistic and homogeneous beauty standard. According to Jia Tolentino, the face is "distinctly white but ambiguously ethnic". The face has been described as a racial composite of different peoples. In 2024, cosmetic surgeon Paul Banwell said, "People used to come to see me asking to look like a particular celebrity, but many patients come to me now wanting to look like the filtered version of themselves." While based on digital filters, the look is achieved in person using heavy applications of makeup or cosmetic surgery. Plastic surgery, Botox injections, and injectable filler have significantly increased in popularity since the rise of digital filters. Influencers market makeup products designed to recreate the look. == History == The growth of reality television series and social media throughout the 2010s has influenced the popularity of Instagram face. In 2019, The New Yorker referred to this phenomenon as "Instagram Face," identifying Kim Kardashian as its "patient zero." Similarly, her younger sister Kylie Jenner significantly impacted the trend with her 2015 lip filler confession, which acted as a catalyst, introducing Juvéderm to a new generation. Sirin Kale of Vice News has described Jenner as "at the vanguard of an aesthetic that’s swept through British towns and cities," while also pointing towards other celebrities such as Iggy Azalea and Farrah Abraham. In 2018, Americans underwent 7 million neurotoxin injections and 2.5 million filler injections and spent $16.5 billion on cosmetic surgery. 92% of the latter was performed on women. Botox usage has also been on the rise. == Criticism == In her 2021 book The Selfie, Temporality, and Contemporary Photography, Claire Raymond of Princeton University criticised "Instagram faces" for erasing "heritable quirks and lived history; it erases what makes the human face so compelling, whether conventionally beautiful or not," while also arguing that the procedures used to create Instagram faces "numb and freeze the face and skin, rendering less mobile the lips, the eyes, and the neck. Numbness is the central feature of the experience for the woman who gets Instagram face through cosmetic procedures. Others may see her more, but she feels less and less." == Influence on popular culture == The increasing popularity of cosmetic surgeries towards a homogeneous ideal has resulted in the emergence of the "goopcore" sub-genre of body horror. The sub-genre combines graphic violence with body modifications from the beauty industry. Allie Rowbottom's goopcore novel Aesthetica centers around an influencer attempting to undo years of plastic surgery with a new experimental procedure.

    Read more →
  • Computer-aided software engineering

    Computer-aided software engineering

    Computer-aided software engineering (CASE) is a domain of software tools used to design and implement applications. CASE tools are similar to and are partly inspired by computer-aided design (CAD) tools used for designing hardware products. CASE tools are intended to help develop high-quality, defect-free, and maintainable software. CASE software was often associated with methods for the development of information systems together with automated tools that could be used in the software development process. == History == The Information System Design and Optimization System (ISDOS) project, started in 1968 at the University of Michigan, initiated a great deal of interest in the whole concept of using computer systems to help analysts in the very difficult process of analysing requirements and developing systems. Several papers by Daniel Teichroew fired a whole generation of enthusiasts with the potential of automated systems development. His Problem Statement Language / Problem Statement Analyzer (PSL/PSA) tool was a CASE tool although it predated the term. Another major thread emerged as a logical extension to the data dictionary of a database. By extending the range of metadata held, the attributes of an application could be held within a dictionary and used at runtime. This "active dictionary" became the precursor to the more modern model-driven engineering capability. However, the active dictionary did not provide a graphical representation of any of the metadata. It was the linking of the concept of a dictionary holding analysts' metadata, as derived from the use of an integrated set of techniques, together with the graphical representation of such data that gave rise to the earlier versions of CASE. The next entrant into the market was Excelerator from Index Technology in Cambridge, Mass. While DesignAid ran on Convergent Technologies and later Burroughs Ngen networked microcomputers, Index launched Excelerator on the IBM PC/AT platform. While, at the time of launch, and for several years, the IBM platform did not support networking or a centralized database as did the Convergent Technologies or Burroughs machines, the allure of IBM was strong, and Excelerator came to prominence. Hot on the heels of Excelerator were a rash of offerings from companies such as Knowledgeware (James Martin, Fran Tarkenton and Don Addington), Texas Instrument's CA Gen and Andersen Consulting's FOUNDATION toolset (DESIGN/1, INSTALL/1, FCP). CASE tools were at their peak in the early 1990s. According to the PC Magazine of January 1990, over 100 companies were offering nearly 200 different CASE tools. At the time IBM had proposed AD/Cycle, which was an alliance of software vendors centered on IBM's Software repository using IBM DB2 in mainframe and OS/2: The application development tools can be from several sources: from IBM, from vendors, and from the customers themselves. IBM has entered into relationships with Bachman Information Systems, Index Technology Corporation, and Knowledgeware wherein selected products from these vendors will be marketed through an IBM complementary marketing program to provide offerings that will help to achieve complete life-cycle coverage. With the decline of the mainframe, AD/Cycle and the Big CASE tools died off, opening the market for the mainstream CASE tools of today. Many of the leaders of the CASE market of the early 1990s ended up being purchased by Computer Associates, including IEW, IEF, ADW, Cayenne, and Learmonth & Burchett Management Systems (LBMS). The other trend that led to the evolution of CASE tools was the rise of object-oriented methods and tools. Most of the various tool vendors added some support for object-oriented methods and tools. In addition new products arose that were designed from the bottom up to support the object-oriented approach. Andersen developed its project Eagle as an alternative to Foundation. Several of the thought leaders in object-oriented development each developed their own methodology and CASE tool set: Jacobson, Rumbaugh, Booch, etc. Eventually, these diverse tool sets and methods were consolidated via standards led by the Object Management Group (OMG). The OMG's Unified Modelling Language (UML) is currently widely accepted as the industry standard for object-oriented modeling. == CASE software == === Tools === CASE tools support specific tasks in the software development life-cycle. They can be divided into the following categories: Business and analysis modeling: Graphical modeling tools. E.g., E/R modeling, object modeling, etc. Development: Design and construction phases of the life-cycle. Debugging environments. E.g., IISE LKO. Verification and validation: Analyze code and specifications for correctness, performance, etc. Configuration management: Control the check-in and check-out of repository objects and files. E.g., SCCS, IISE. Metrics and measurement: Analyze code for complexity, modularity (e.g., no "go to's"), performance, etc. Project management: Manage project plans, task assignments, scheduling. Another common way to distinguish CASE tools is the distinction between Upper CASE and Lower CASE. Upper CASE Tools support business and analysis modeling. They support traditional diagrammatic languages such as ER diagrams, Data flow diagram, Structure charts, Decision Trees, Decision tables, etc. Lower CASE Tools support development activities, such as physical design, debugging, construction, testing, component integration, maintenance, and reverse engineering. All other activities span the entire life-cycle and apply equally to upper and lower CASE. === Workbenches === Workbenches integrate two or more CASE tools and support specific software-process activities. Hence they achieve: A homogeneous and consistent interface (presentation integration) Seamless integration of tools and toolchains (control and data integration) An example workbench is Microsoft's Visual Basic programming environment. It incorporates several development tools: a GUI builder, a smart code editor, debugger, etc. Most commercial CASE products tended to be such workbenches that seamlessly integrated two or more tools. Workbenches also can be classified in the same manner as tools; as focusing on Analysis, Development, Verification, etc. as well as being focused on the upper case, lower case, or processes such as configuration management that span the complete life-cycle. === Environments === An environment is a collection of CASE tools or workbenches that attempts to support the complete software process. This contrasts with tools that focus on one specific task or a specific part of the life-cycle. CASE environments are classified by Fuggetta as follows: Toolkits: Loosely coupled collections of tools. These typically build on operating system workbenches such as the Unix Programmer's Workbench or the VMS VAX set. They typically perform integration via piping or some other basic mechanism to share data and pass control. The strength of easy integration is also one of the drawbacks. Simple passing of parameters via technologies such as shell scripting can't provide the kind of sophisticated integration that a common repository database can. Fourth generation: These environments are also known as 4GL standing for fourth generation language environments due to the fact that the early environments were designed around specific languages such as Visual Basic. They were the first environments to provide deep integration of multiple tools. Typically these environments were focused on specific types of applications. For example, user-interface driven applications that did standard atomic transactions to a relational database. Examples are Informix 4GL, and Focus. Language-centered: Environments based on a single often object-oriented language such as the Symbolics Lisp Genera environment or VisualWorks Smalltalk from Parcplace. In these environments all the operating system resources were objects in the object-oriented language. This provides powerful debugging and graphical opportunities but the code developed is mostly limited to the specific language. For this reason, these environments were mostly a niche within CASE. Their use was mostly for prototyping and R&D projects. A common core idea for these environments was the model–view–controller user interface that facilitated keeping multiple presentations of the same design consistent with the underlying model. The MVC architecture was adopted by the other types of CASE environments as well as many of the applications that were built with them. Integrated: These environments are an example of what most IT people tend to think of first when they think of CASE. Environments such as IBM's AD/Cycle, Andersen Consulting's FOUNDATION, the ICL CADES system, and DEC Cohesion. These environments attempt to cover the complete life-cycle from analysis to maintenance and provide an integrated database repository for storing all artifacts of the software pr

    Read more →
  • Instance-based learning

    Instance-based learning

    In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training, which have been stored in memory. Because computation is postponed until a new instance is observed, these algorithms are sometimes referred to as "lazy." It is called instance-based because it constructs hypotheses directly from the training instances themselves. This means that the hypothesis complexity can grow with the data: in the worst case, a hypothesis is a list of n training items and the computational complexity of classifying a single new instance is O(n). One advantage that instance-based learning has over other methods of machine learning is its ability to adapt its model to previously unseen data. Instance-based learners may simply store a new instance or throw an old instance away. Examples of instance-based learning algorithms are the k-nearest neighbors algorithm, kernel machines and RBF networks. These store (a subset of) their training set; when predicting a value/class for a new instance, they compute distances or similarities between this instance and the training instances to make a decision. To battle the memory complexity of storing all training instances, as well as the risk of overfitting to noise in the training set, instance reduction algorithms have been proposed.

    Read more →
  • Cut, copy, and paste

    Cut, copy, and paste

    Cut, copy, and paste are essential commands of modern human–computer interaction and user interface design. They offer an interprocess communication technique for transferring data through a computer's user interface. The cut command removes the selected data from its original position, and the copy command creates a duplicate; in both cases the selected data is kept in temporary storage called the clipboard. Clipboard data is later inserted wherever a paste command is issued. The data remains available to any application supporting the feature, thus allowing easy data transfer between applications. The command names are a (skeuomorphic) interface metaphor based on the physical procedure used in manuscript print editing to create a page layout, like with paper. The commands were pioneered into computing by Xerox PARC in 1974, popularized by Apple Computer in the 1983 Lisa workstation and the 1984 Macintosh computer, and in a few home computer applications such as the 1984 word processor Cut & Paste. This interaction technique has close associations with related techniques in graphical user interfaces (GUIs) that use pointing devices such as a computer mouse (by drag and drop, for example). Typically, clipboard support is provided by an operating system as part of its GUI and widget toolkit. The capability to replicate information with ease, changing it between contexts and applications, involves privacy concerns because of the risks of disclosure when handling sensitive information. Terms like cloning, copy forward, carry forward, or re-use refer to the dissemination of such information through documents, and may be subject to regulation by administrative bodies. == History == === Origins === The term "cut and paste" comes from the traditional practice in manuscript editing, whereby people cut paragraphs from a page with scissors and paste them onto another page. This practice remained standard into the 1980s. Stationery stores sold "editing scissors" with blades long enough to cut an 8½"-wide page. The advent of photocopiers made the practice easier and more flexible. The act of copying or transferring text from one part of a computer-based document ("buffer") to a different location within the same or different computer-based document was a part of the earliest on-line computer editors. As soon as computer data entry moved from punch-cards to online files (in the mid/late 1960s) there were "commands" for accomplishing this operation. This mechanism was often used to transfer frequently-used commands or text snippets from additional buffers into the document, as was the case with the QED text editor. === Early methods === The earliest editors (designed for teleprinter terminals) provided keyboard commands to delineate a contiguous region of text, then delete or move it. Since moving a region of text requires first removing it from its initial location and then inserting it into its new location, various schemes had to be invented to allow for this multi-step process to be specified by the user. Often this was done with a "move" command, but some text editors required that the text be first put into some temporary location for later retrieval/placement. In 1983, the Apple Lisa became the first text editing system to call that temporary location "the clipboard". Earlier control schemes such as NLS used a verb—object command structure, where the command name was provided first and the object to be copied or moved was second. The inversion from verb—object to object—verb on which copy and paste are based, where the user selects the object to be operated before initiating the operation, was an innovation crucial for the success of the desktop metaphor as it allowed copy and move operations based on direct manipulation. === Popularization === Inspired by early line and character editors, such as Pentti Kanerva's TV-Edit, that broke a move or copy operation into two steps—between which the user could invoke a preparatory action such as navigation—Lawrence G. "Larry" Tesler proposed the names "cut" and "copy" for the first step and "paste" for the second step. Beginning in 1974, he and colleagues at Xerox PARC implemented several text editors that used cut/copy-and-paste commands to move and copy text. Apple Computer popularized this paradigm with its Lisa (1983) and Macintosh (1984) operating systems and applications. The functions were mapped to key combinations using the ⌘ Command key as a special modifier, which is held down while also pressing X for cut, C for copy, or V for paste. These few keyboard shortcuts allow the user to perform all the basic editing operations, and the keys are clustered at the left end of the bottom row of the standard QWERTY keyboard. These are the standard shortcuts: Control-Z (or ⌘ Command+Z) to undo Control-X (or ⌘ Command+X) to cut Control-C (or ⌘ Command+C) to copy Control-V (or ⌘ Command+V) to paste The IBM Common User Access (CUA) standard also uses combinations of the Insert, Del, Shift and Control keys. Early versions of Windows used the IBM standard. Microsoft later also adopted the Apple key combinations with the introduction of Windows, using the control key as modifier key. Similar patterns of key combinations, later borrowed by others, are widely available in most GUI applications. The original cut, copy, and paste workflow, as implemented at PARC, utilizes a unique workflow: With two windows on the same screen, the user could use the mouse to pick a point at which to make an insertion in one window (or a segment of text to replace). Then, by holding shift and selecting the copy source elsewhere on the same screen, the copy would be made as soon as the shift was released. Similarly, holding shift and control would copy and cut (delete) the source. This workflow requires many fewer keystrokes/mouse clicks than the current multi-step workflows, and did not require an explicit copy buffer. It was dropped, one presumes, because the original Apple and IBM GUIs were not high enough density to permit multiple windows, as were the PARC machines, and so multiple simultaneous windows were rarely used. == Cut and paste == Computer-based editing can involve very frequent use of cut-and-paste operations. Most software-suppliers provide several methods for performing such tasks, and this can involve (for example) key combinations, pulldown menus, pop-up menus, or toolbar buttons. The user selects or "highlights" the text or file for moving by some method, typically by dragging over the text or file name with the pointing-device or holding down the Shift key while using the arrow keys to move the text cursor. The user performs a "cut" operation via key combination Ctrl+x (⌘+x for Macintosh users), menu, or other means. Visibly, "cut" text immediately disappears from its location. "Cut" files typically change color to indicate that they will be moved. Conceptually, the text has now moved to a location often called the clipboard. The clipboard typically remains invisible. On most systems only one clipboard location exists, hence another cut or copy operation overwrites the previously stored information. Many UNIX text-editors provide multiple clipboard entries, as do some Macintosh programs such as Clipboard Master, and Windows clipboard-manager programs such as the one in Microsoft Office. The user selects a location for insertion by some method, typically by clicking at the desired insertion point. A paste operation takes place which visibly inserts the clipboard text at the insertion point. (The paste operation does not typically destroy the clipboard text: it remains available in the clipboard and the user can insert additional copies at other points). Whereas cut-and-paste often takes place with a mouse-equivalent in Windows-like GUI environments, it may also occur entirely from the keyboard, especially in UNIX text editors, such as Pico or vi. Cutting and pasting without a mouse can involve a selection (for which Ctrl+x is pressed in most graphical systems) or the entire current line, but it may also involve text after the cursor until the end of the line and other more sophisticated operations. The clipboard usually stays invisible, because the operations of cutting and pasting, while actually independent, usually take place in quick succession, and the user (usually) needs no assistance in understanding the operation or maintaining mental context. Some application programs provide a means of viewing, or sometimes even editing, the data on the clipboard. == Copy and paste == The term "copy-and-paste" refers to the popular, simple method of reproducing text or other data from a source to a destination. It differs from cut and paste in that the original source text or data does not get deleted or removed. The popularity of this method stems from its simplicity and the ease with which users can move data between various applications visually – without resorting to permanent storage. Use in healthcare do

    Read more →
  • Batch cryptography

    Batch cryptography

    Batch cryptography is a field of cryptology focused on the design of cryptographic protocols that perform operations—such as encryption, decryption, key exchange, and authentication—on multiple inputs simultaneously, rather than processing each input individually. Batching cryptographic operations can significantly reduce the marginal cost of handling individual inputs—a principle that was first introduced by Amos Fiat in 1989.

    Read more →
  • Plaintext

    Plaintext

    In cryptography, plaintext usually means unencrypted information pending input into cryptographic algorithms, usually encryption algorithms. This usually refers to data that is transmitted or stored unencrypted. == Overview == With the advent of computing, the term plaintext expanded beyond human-readable documents to mean any data, including binary files, in a form that can be viewed or used without requiring a key or other decryption device. Information—a message, document, file, etc.—if to be communicated or stored in an unencrypted form is referred to as plaintext. Plaintext is used as input to an encryption algorithm; the output is usually termed ciphertext, particularly when the algorithm is a cipher. Codetext is less often used, and almost always only when the algorithm involved is actually a code. Some systems use multiple layers of encryption, with the output of one encryption algorithm becoming "plaintext" input for the next. == Secure handling == Insecure handling of plaintext can introduce weaknesses into a cryptosystem by letting an attacker bypass the cryptography altogether. Plaintext is vulnerable in use and in storage, whether in electronic or paper format. Physical security means the securing of information and its storage media from physical, attack—for instance by someone entering a building to access papers, storage media, or computers. Discarded material, if not disposed of securely, may be a security risk. Even shredded documents and erased magnetic media might be reconstructed with sufficient effort. If plaintext is stored in a computer file, the storage media, the computer and its components, and all backups must be secure. Sensitive data is sometimes processed on computers whose mass storage is removable, in which case physical security of the removed disk is vital. In the case of securing a computer, useful (as opposed to handwaving) security must be physical (e.g., against burglary, brazen removal under cover of supposed repair, installation of covert monitoring devices, etc.), as well as virtual (e.g., operating system modification, illicit network access, Trojan programs). Wide availability of keydrives, which can plug into most modern computers and store large quantities of data, poses another severe security headache. A spy (perhaps posing as a cleaning person) could easily conceal one, and even swallow it if necessary. Discarded computers, disk drives and media are also a potential source of plaintexts. Most operating systems do not actually erase anything— they simply mark the disk space occupied by a deleted file as 'available for use', and remove its entry from the file system directory. The information in a file deleted in this way remains fully present until overwritten at some later time when the operating system reuses the disk space. With even low-end computers commonly sold with many gigabytes of disk space and rising monthly, this 'later time' may be months later, or never. Even overwriting the portion of a disk surface occupied by a deleted file is insufficient in many cases. Peter Gutmann of the University of Auckland wrote a celebrated 1996 paper on the recovery of overwritten information from magnetic disks; areal storage densities have gotten much higher since then, so this sort of recovery is likely to be more difficult than it was when Gutmann wrote. Modern hard drives automatically remap failing sectors, moving data to good sectors. This process makes information on those failing, excluded sectors invisible to the file system and normal applications. Special software, however, can still extract information from them. Some government agencies (e.g., US NSA) require that personnel physically pulverize discarded disk drives and, in some cases, treat them with chemical corrosives. This practice is not widespread outside government, however. Garfinkel and Shelat (2003) analyzed 158 second-hand hard drives they acquired at garage sales and the like, and found that less than 10% had been sufficiently sanitized. The others contained a wide variety of readable personal and confidential information. See data remanence. Physical loss is a serious problem. The US State Department, Department of Defense, and the British Secret Service have all had laptops with secret information, including in plaintext, lost or stolen. Appropriate disk encryption techniques can safeguard data on misappropriated computers or media. On occasion, even when data on host systems is encrypted, media that personnel use to transfer data between systems is plaintext because of poorly designed data policy. For example, in October 2007, HM Revenue and Customs lost CDs that contained the unencrypted records of 25 million child benefit recipients in the United Kingdom. Modern cryptographic systems resist known plaintext or even chosen plaintext attacks, and so may not be entirely compromised when plaintext is lost or stolen. Older systems resisted the effects of plaintext data loss on security with less effective techniques—such as padding and Russian copulation to obscure information in plaintext that could be easily guessed.

    Read more →
  • Empirical risk minimization

    Empirical risk minimization

    In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice (i.e. the "true risk") because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk". == Background == The following situation is a general setting of many supervised learning problems. There are two spaces of objects X {\displaystyle X} and Y {\displaystyle Y} and we would like to learn a function h : X → Y {\displaystyle \ h:X\to Y} (often called hypothesis) which outputs an object y ∈ Y {\displaystyle y\in Y} , given x ∈ X {\displaystyle x\in X} . To do so, there is a training set of n {\displaystyle n} examples ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle \ (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} where x i ∈ X {\displaystyle x_{i}\in X} is an input and y i ∈ Y {\displaystyle y_{i}\in Y} is the corresponding response that is desired from h ( x i ) {\displaystyle h(x_{i})} . To put it more formally, assuming that there is a joint probability distribution P ( x , y ) {\displaystyle P(x,y)} over X {\displaystyle X} and Y {\displaystyle Y} , and that the training set consists of n {\displaystyle n} instances ( x 1 , y 1 ) , … , ( x n , y n ) {\displaystyle \ (x_{1},y_{1}),\ldots ,(x_{n},y_{n})} drawn i.i.d. from P ( x , y ) {\displaystyle P(x,y)} . The assumption of a joint probability distribution allows for the modelling of uncertainty in predictions (e.g. from noise in data) because y {\displaystyle y} is not a deterministic function of x {\displaystyle x} , but rather a random variable with conditional distribution P ( y | x ) {\displaystyle P(y|x)} for a fixed x {\displaystyle x} . It is also assumed that there is a non-negative real-valued loss function L ( y ^ , y ) {\displaystyle L({\hat {y}},y)} which measures how different the prediction y ^ {\displaystyle {\hat {y}}} of a hypothesis is from the true outcome y {\displaystyle y} . For classification tasks, these loss functions can be scoring rules. The risk associated with hypothesis h ( x ) {\displaystyle h(x)} is then defined as the expectation of the loss function: R ( h ) = E [ L ( h ( x ) , y ) ] = ∫ L ( h ( x ) , y ) d P ( x , y ) . {\displaystyle R(h)=\mathbf {E} [L(h(x),y)]=\int L(h(x),y)\,dP(x,y).} A loss function commonly used in theory is the 0-1 loss function: L ( y ^ , y ) = { 1 if y ^ ≠ y 0 if y ^ = y {\displaystyle L({\hat {y}},y)={\begin{cases}1&{\mbox{ if }}\quad {\hat {y}}\neq y\\0&{\mbox{ if }}\quad {\hat {y}}=y\end{cases}}} . The ultimate goal of a learning algorithm is to find a hypothesis h ∗ {\displaystyle h^{}} among a fixed class of functions H {\displaystyle {\mathcal {H}}} for which the risk R ( h ) {\displaystyle R(h)} is minimal: h ∗ = a r g m i n h ∈ H R ( h ) . {\displaystyle h^{}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,{R(h)}.} For classification problems, the Bayes classifier is defined to be the classifier minimizing the risk defined with the 0–1 loss function. == Formal definition == In general, the risk R ( h ) {\displaystyle R(h)} cannot be computed because the distribution P ( x , y ) {\displaystyle P(x,y)} is unknown to the learning algorithm. However, given a sample of iid training data points, we can compute an estimate, called the empirical risk, by computing the average of the loss function over the training set; more formally, computing the expectation with respect to the empirical measure: R emp ( h ) = 1 n ∑ i = 1 n L ( h ( x i ) , y i ) . {\displaystyle \!R_{\text{emp}}(h)={\frac {1}{n}}\sum _{i=1}^{n}L(h(x_{i}),y_{i}).} The empirical risk minimization principle states that the learning algorithm should choose a hypothesis h ^ {\displaystyle {\hat {h}}} which minimizes the empirical risk over the hypothesis class H {\displaystyle {\mathcal {H}}} : h ^ = a r g m i n h ∈ H R emp ( h ) . {\displaystyle {\hat {h}}={\underset {h\in {\mathcal {H}}}{\operatorname {arg\,min} }}\,R_{\text{emp}}(h).} Thus, the learning algorithm defined by the empirical risk minimization principle consists in solving the above optimization problem. == Properties == Guarantees for the performance of empirical risk minimization depend strongly on the function class selected as well as the distributional assumptions made. In general, distribution-free methods are too coarse, and do not lead to practical bounds. However, they are still useful in deriving asymptotic properties of learning algorithms, such as consistency. In particular, distribution-free bounds on the performance of empirical risk minimization given a fixed function class can be derived using bounds on the VC complexity of the function class. For simplicity, considering the case of binary classification tasks, it is possible to bound the probability of the selected classifier, ϕ n {\displaystyle \phi _{n}} being much worse than the best possible classifier ϕ ∗ {\displaystyle \phi ^{}} . Consider the risk L {\displaystyle L} defined over the hypothesis class C {\displaystyle {\mathcal {C}}} with growth function S ( C , n ) {\displaystyle {\mathcal {S}}({\mathcal {C}},n)} given a dataset of size n {\displaystyle n} . Then, for every ϵ > 0 {\displaystyle \epsilon >0} : P ( L ( ϕ n ) − L ( ϕ ∗ ) > ϵ ) ≤ 8 S ( C , n ) exp ⁡ { − n ϵ 2 / 32 } {\displaystyle \mathbb {P} \left(L(\phi _{n})-L(\phi ^{})>\epsilon \right)\leq {\mathcal {8}}S({\mathcal {C}},n)\exp\{-n\epsilon ^{2}/32\}} Similar results hold for regression tasks. These results are often based on uniform laws of large numbers, which control the deviation of the empirical risk from the true risk, uniformly over the hypothesis class. === Impossibility results === It is also possible to show lower bounds on algorithm performance if no distributional assumptions are made. This is sometimes referred to as the No free lunch theorem. Even though a specific learning algorithm may provide the asymptotically optimal performance for any distribution, the finite sample performance is always poor for at least one data distribution. This means that no classifier can improve on the error for a given sample size for all distributions. Specifically, let ϵ > 0 {\displaystyle \epsilon >0} and consider a sample size n {\displaystyle n} and classification rule ϕ n {\displaystyle \phi _{n}} , there exists a distribution of ( X , Y ) {\displaystyle (X,Y)} with risk L ∗ = 0 {\displaystyle L^{}=0} (meaning that perfect prediction is possible) such that: E L n ≥ 1 / 2 − ϵ . {\displaystyle \mathbb {E} L_{n}\geq 1/2-\epsilon .} It is further possible to show that the convergence rate of a learning algorithm is poor for some distributions. Specifically, given a sequence of decreasing positive numbers a i {\displaystyle a_{i}} converging to zero, it is possible to find a distribution such that: E L n ≥ a i {\displaystyle \mathbb {E} L_{n}\geq a_{i}} for all n {\displaystyle n} . This result shows that universally good classification rules do not exist, in the sense that the rule must be low quality for at least one distribution. === Computational complexity === Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers. Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable. In practice, machine learning algorithms cope with this issue either by employing a convex approximation to the 0–1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P ( x , y ) {\displaystyle P(x,y)} (and thus stop being agnostic learning algorithms to which the above result applies). In the case of convexification, Zhang's lemma majors the excess risk of the original problem using the excess risk of the convexified problem. Minimizing the latter using convex optimization also allow to control the former. == Tilted empirical risk minimization == Tilted empirical risk minimization is a machine learning technique used to modify standard loss functions like squared error, by introducing a tilt parameter. This parameter dynamically adjusts the weight of data points during training, allowing the algorithm to focus on specific regions or characteristics of the data distribution. Tilted empirical risk minimization is particularly useful in scenarios with imbalanced data or when there is a need to emphasize errors in certain parts of the prediction space.

    Read more →
  • Knapsack cryptosystems

    Knapsack cryptosystems

    Knapsack cryptosystems are cryptosystems whose security is based on the hardness of solving the knapsack problem. They remain quite unpopular because simple versions of these algorithms have been broken for several decades. However, that type of cryptosystem is a good candidate for post-quantum cryptography. The most famous knapsack cryptosystem is the Merkle-Hellman Public Key Cryptosystem, one of the first public key cryptosystems, published the same year as the RSA cryptosystem. However, this system has been broken by several attacks: one from Shamir, one by Adleman, and the low density attack. However, there exist modern knapsack cryptosystems that are considered secure so far: among them is Nasako-Murakami 2006. Knapsack cryptosystems, when not subject to classical cryptoanalysis, are believed to be difficult even for quantum computers. That is not the case for systems that rely on factoring large integers, like RSA, or computing discrete logarithms, like ECDSA, problems solved in polynomial time with Shor's algorithm.

    Read more →
  • Data steward

    Data steward

    A data steward is an oversight or data governance role within an organization, and is responsible for ensuring the quality and fitness for purpose of the organization's data assets, including the metadata for those data assets. A data steward may share some responsibilities with a data custodian, such as the awareness, accessibility, release, appropriate use, security and management of data. A data steward would also participate in the development and implementation of data assets. A data steward may seek to improve the quality and fitness for purpose of other data assets their organization depends upon but is not responsible for. Data stewards have a specialist role that utilizes an organization's data governance processes, policies, guidelines and responsibilities for administering an organizations' entire data in compliance with policy and/or regulatory obligations (e.g., GDPR, HIPAA). The overall objective of a data steward is the data quality of the data assets, datasets, data records and data elements. This includes documenting metainformation for the data, such as definitions, related rules/governance, physical manifestation, and related data models (most of these properties being specific to an attribute/concept relationship), identifying owners/custodian's various responsibilities, relations insight pertaining to attribute quality, aiding with project requirement data facilitation and documentation of capture rules. Data stewards begin the stewarding process with the identification of the data assets and elements which they will steward, with the ultimate result being standards, controls and data entry. The steward works closely with business glossary standards analysts (for standards), with data architect/modelers (for standards), with DQ analysts (for controls) and with operations team members (good-quality data going in per business rules) while entering data. Data stewardship roles are common when organizations attempt to exchange data precisely and consistently between computer systems and to reuse data-related resources. Master data management often makes references to the need for data stewardship for its implementation to succeed. Data stewardship must have precise purpose, fit for purpose or fitness. == Data steward responsibilities == A data steward ensures that each assigned data element: Has clear and unambiguous data element definition Does not conflict with other data elements in the metadata registry (removes duplicates, overlap etc.) Has clear enumerated value definitions if it is of type Code Is still being used (remove unused data elements) Is being used consistently in various computer systems Is being used, fit for purpose = Data Fitness Has adequate documentation on appropriate usage and notes Documents the origin and sources of authority on each metadata element Is protected against unauthorised access or change Responsibilities of data stewards vary between different organisations and institutions. For example, at Delft University of Technology, data stewards are perceived as the first contact point for any questions related to research data. They also have subject-specific background allowing them to easily connect with researchers and to contextualise data management problems to take into account disciplinary practices. == Types of data stewards == Depending on the set of data stewardship responsibilities assigned to an individual, there are 4 types (or dimensions of responsibility) of data stewards typically found within an organization: Data object data steward - responsible for managing reference data and attributes of one business data entity Business data steward - responsible for managing critical data, both reference and transactional, created or used by one business function. The data steward may also serve as a liaison between the organization's data users and technical teams, helping to bridge the gap between business needs and technical requirements. They may also play a role in educating others within the organization about best practices for data management, and advocating for data-driven decision-making. Process data steward - responsible for managing data across one business process System data steward - responsible for managing data for at least one IT system == Benefits of data stewardship == Systematic data stewardship can foster: Faster analysis Consistent use of data management resources Easy mapping of data between computer systems and exchange documents Lower costs associated with migration to (for example) service-oriented architecture (SOA) Mitigation of data risk Better control of dangers associated with privacy, legal, errors, etc. Assignment of each data element to a person sometimes seems like an unimportant process. But multiple groups have found that users have greater trust and usage rates in systems where they can contact a person with questions on each data element. == Examples == Delft University of Technology (TU Delft) offers an example of data stewardship implementation at a research institution. In 2017 the Data Stewardship Project was initiated at TU Delft to address research data management needs in a disciplinary manner across the whole campus. Dedicated data stewards with subject-specific background were appointed at every TU Delft faculty to support researchers with data management questions and to act as a linking point with the other institutional support services. The project is coordinated centrally by TU Delft Library, and it has its own website, blog and a YouTube channel. The [1]EPA metadata registry furnishes an example of data stewardship. Note that each data element therein has a "POC" (point of contact). In 2023, ETH Zurich launched the Data Stewardship Network (DSN) to facilitate collaboration among employees engaged in data management, analysis, and code development across research groups. The DSN serves as a platform for networking and knowledge exchange, aiming to professionalize the role of data stewards who support research data management and reproducible workflows. Established by the team for Research Data Management and Digital Curation at the ETH Library, the DSN collaborates with Scientific IT Services to provide expertise in areas such as storage infrastructure and reproducible workflows. == Data stewardship applications == Information stewardship applications are business solutions used by business users acting in the role of information steward (interpreting and enforcing information governance policy, for example). These developing solutions represent, for the most part, an amalgam of a number of disparate, previously IT-centric tools already on the market, but are organized and presented in such a way that information stewards (a business role) can support the work of information policy enforcement as part of their normal, business-centric, day-to-day work in a range of use cases. The initial push for the formation of this new category of packaged software came from operational use cases — that is, use of business data in and between transactional and operational business applications. This is where most of the master data management efforts are undertaken in organizations. However, there is also now a faster-growing interest in the new data lake arena for more analytical use cases.

    Read more →
  • Sentiment analysis

    Sentiment analysis

    Sentiment analysis (also known as opinion mining) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly. == Types == A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as enjoyment, anger, disgust, sadness, fear, and surprise. Precursors to sentimental analysis include the General Inquirer, which provided hints toward quantifying patterns in text and, separately, psychological research that examined a person's psychological state based on analysis of their verbal behavior. Subsequently, the method described in a patent by Volcani and Fogel, looked specifically at sentiment and identified individual words and phrases in text with respect to different emotional scales. A current system based on their work, called EffectCheck, presents synonyms that can be used to increase or decrease the level of evoked emotion in each scale. Many other subsequent efforts were less sophisticated, using a mere polar view of sentiment, from positive to negative, such as work by Turney, and Pang who applied different methods for detecting the polarity of product reviews and movie reviews respectively. This work is at the document level. One can also classify a document's polarity on a multi-way scale, which was attempted by Pang and Snyder among others: Pang and Lee expanded the basic task of classifying a movie review as either positive or negative to predict star ratings on either a 3- or a 4-star scale, while Snyder performed an in-depth analysis of restaurant reviews, predicting ratings for various aspects of the given restaurant, such as the food and atmosphere (on a five-star scale). First steps to bringing together various approaches—learning, lexical, knowledge-based, etc.—were taken in the 2004 AAAI Spring Symposium where linguists, computer scientists, and other interested researchers first aligned interests and proposed shared tasks and benchmark data sets for the systematic computational research on affect, appeal, subjectivity, and sentiment in text. Even though in most statistical classification methods, the neutral class is ignored under the assumption that neutral texts lie near the boundary of the binary classifier, several researchers suggest that, as in every polarity problem, three categories must be identified. Moreover, it can be proven that specific classifiers such as the Max Entropy and SVMs can benefit from the introduction of a neutral class and improve the overall accuracy of the classification. There are in principle two ways for operating with a neutral class. Either, the algorithm proceeds by first identifying the neutral language, filtering it out and then assessing the rest in terms of positive and negative sentiments, or it builds a three-way classification in one step. This second approach often involves estimating a probability distribution over all categories (e.g. naive Bayes classifiers as implemented by the NLTK). Whether and how to use a neutral class depends on the nature of the data: if the data is clearly clustered into neutral, negative and positive language, it makes sense to filter the neutral language out and focus on the polarity between positive and negative sentiments. If, in contrast, the data are mostly neutral with small deviations towards positive and negative affect, this strategy would make it harder to clearly distinguish between the two poles. A different method for determining sentiment is the use of a scaling system whereby words commonly associated with having a negative, neutral, or positive sentiment are given an associated number on a −10 to +10 scale (most negative up to most positive) or simply from 0 to a positive upper limit such as +4. This makes it possible to adjust the sentiment of a given term relative to its environment (usually on the level of the sentence). When a piece of unstructured text is analyzed using natural language processing, each concept in the specified environment is given a score based on the way sentiment words relate to the concept and its associated score. This allows movement to a more sophisticated understanding of sentiment, because it is now possible to adjust the sentiment value of a concept relative to modifications that may surround it. Words, for example, that intensify, relax or negate the sentiment expressed by the concept can affect its score. Alternatively, texts can be given a positive and negative sentiment strength score if the goal is to determine the sentiment in a text rather than the overall polarity and strength of the text. There are various other types of sentiment analysis, such as aspect-based sentiment analysis, grading sentiment analysis (positive, negative, neutral), multilingual sentiment analysis and detection of emotions. === Subjectivity/objectivity identification === This task is commonly defined as classifying a given text (usually a sentence) into one of two classes: objective or subjective. This problem can sometimes be more difficult than polarity classification. The subjectivity of words and phrases may depend on their context and an objective document may contain subjective sentences (e.g., a news article quoting people's opinions). Moreover, as mentioned by Su, results are largely dependent on the definition of subjectivity used when annotating texts. However, Pang showed that removing objective sentences from a document before classifying its polarity helped improve performance. Subjective and objective identification, emerging subtasks of sentiment analysis to use syntactic, semantic features, and machine learning knowledge to identify if a sentence or document contains facts or opinions. Awareness of recognizing factual and opinions is not recent, having possibly first presented by Carbonell at Yale University in 1979. The term objective refers to the incident carrying factual information. Example of an objective sentence: 'To be elected president of the United States, a candidate must be at least thirty-five years of age.' The term subjective describes the incident contains non-factual information in various forms, such as personal opinions, judgment, and predictions, also known as 'private states'. In the example down below, it reflects a private states 'We Americans'. Moreover, the target entity commented by the opinions can take several forms from tangible product to intangible topic matters stated in Liu (2010). Furthermore, three types of attitudes were observed by Liu (2010), 1) positive opinions, 2) neutral opinions, and 3) negative opinions. Example of a subjective sentence: 'We Americans need to elect a president who is mature and who is able to make wise decisions.' This analysis is a classification problem. Each class's collections of words or phrase indicators are defined for to locate desirable patterns on unannotated text. For subjective expression, a different word list has been created. Lists of subjective indicators in words or phrases have been developed by multiple researchers in the linguist and natural language processing field states in Riloff et al. (2003). A dictionary of extraction rules has to be created for measuring given expressions. Over the years, in subjective detection, the features extraction progression from curating features by hand to automated features learning. At the moment, automated learning methods can further separate into supervised and unsupervised machine learning. Patterns extraction with machine learning process annotated and unannotated text have been explored extensively by academic researchers. However, researchers recognized several challenges in developing fixed sets of rules for expressions respectably. Much of the challenges in rule development stems from the nature of textual information. Six challenges have been recognized by several researchers: 1) metaphorical expressions, 2) discrepancies in writings, 3) context-sensitive, 4) represented words with fewer usages, 5) time-sensitive, and 6) ever-growing volume. Metaphorical expressions. The text contains metaphoric expression may impact on the performance on the extraction. Besides, metaphors take in different forms, which may have been contribu

    Read more →
  • VieON

    VieON

    VieON is an mobile application for television and video on demand provided by VieON Joint Stock Company (formerly Dzones), a subsidiary of DatVietVAC Media and Entertainment Group in Vietnam. The app was launched in 2020, featuring over 140 domestic and international television channels, original series, popular entertainment programs known nationwide, top-tier sports events and live streaming of major events. Additionally, VieON provides animated films, television series and television programs from various countries such as South Korea and China. == History == The application was planned for development in 2016, with the cooperation of strategic consulting partner BCG Digital Ventures from the United States. Prior to 2020, VieON was a rebranded version of VTVcab ON, a product managed by Vietnam Cable Television Corporation (VTVCab) and DatVietVAC. On June 15, 2020, after four years of research and testing, the new version of VieON was officially released by DatVietVAC Group, with Vie Channel Joint Stock Company as the business entity and service provider. This is considered the official launch date of the application. On July 21, 2023, VieON transitioned its business operations and service provision to VieON Joint Stock Company. In January 2024, VieON officially launched its global version, VieON Global, targeting Vietnamese users living abroad. == Background == According to Kantar Media Vietnam, up to 84% of Vietnamese people aged 15–54 use social media daily, and in a similar survey by Nielsen, 90% of respondents said they watch live TV weekly. Additionally, according to research organization Muvi, Southeast Asia's OTT market revenue could reach $650 million annually starting next year. Understanding this, DatVietVAC Group has planned to research and develop an OTT application, even though the Vietnamese market already has some major players such as FPT Play and the international giant Netflix. Additionally, DatVietVAC does not hide its ambition to make this application the number one entertainment channel for Vietnamese people.

    Read more →
  • Intranet

    Intranet

    An intranet is a computer network for sharing information, easier communication, collaboration tools, operational systems, and other computing services within an organization, usually to the exclusion of access by outsiders. The term is used in contrast to public networks, such as the Internet, but uses the same technology based on the Internet protocol suite. An organization-wide intranet can constitute a focal point of internal communication and collaboration, and provide a single starting point to access internal and external resources. In its simplest form, an intranet is established with the technologies for local area networks (LANs) and wide area networks (WANs). Many modern intranets have search engines, user profiles, blogs, mobile apps with notifications, and events planning within their infrastructure. An intranet is sometimes contrasted to an extranet. While an intranet is generally restricted to employees of the organization, extranets may also be accessed by customers, suppliers, or other approved parties. Extranets extend a private network onto the Internet with special provisions for authentication, authorization and accounting (AAA protocol). == Uses == Intranets are increasingly being used to deliver tools, such as for collaboration (to facilitate working in groups and teleconferencing) or corporate directories, sales and customer relationship management, or project management. Intranets are also used as corporate culture-change platforms. For example, a large number of employees using an intranet forum application to host a discussion about key issues could come up with new ideas related to management, productivity, quality, and other corporate issues. In large intranets, website traffic is often similar to public website traffic and can be better understood by using web metrics software to track overall activity. User surveys also improve intranet website effectiveness. Larger businesses allow users within their intranet to access public internet through firewall servers. They have the ability to screen incoming and outgoing messages, keeping security intact. When part of an intranet is made accessible to customers and others outside the business, it becomes part of an extranet. Businesses can send private messages through the public network using special encryption/decryption and other security safeguards to connect one part of their intranet to another. Intranet user-experience, editorial, and technology teams work together to produce in-house sites. Most commonly, intranets are managed by the communications, HR or CIO departments of large organizations, or some combination of these. Because of the scope and variety of content and the number of system interfaces, the intranets of many organizations are much more complex than their respective public websites. Intranets and the use of intranets are growing rapidly. According to the Intranet Design Annual 2007 from Nielsen Norman Group, the number of pages on participants' intranets averaged 200,000 over the years 2001 to 2003 and has grown to an average of 6 million pages over 2005–2007. == Benefits == Intranets can help users locate and view information faster and use applications relevant to their roles and responsibilities. With a web browser interface, users can access data held in any database the organization wants to make available at any time and — subject to security provisions — from anywhere within company workstations, increasing employees' ability to perform their jobs faster, more accurately, and with confidence that they have the right information. It also helps improve services provided to users. Using hypermedia and Web technology, Web publishing allows for the maintenance of and easy access to cumbersome corporate knowledge, such as employee manuals, benefits documents, company policies, business standards, news feeds, and even training, all of which can be accessed throughout a company using common Internet standards (Acrobat files, Flash files, CGI applications). Because each business unit can update the online copy of a document, the most recent version is usually available to employees using the intranet. Intranets are also used as a platform for developing and deploying applications to support business operations and decisions across the internetworked enterprise. Information is easily accessible to all authorised users, enabling collaboration. Being able to communicate in real-time through integrated third-party tools, such as an instant messenger, promotes the sharing of ideas and removes blockages to communication to help boost a business's productivity. Intranets can serve as powerful tools for communicating (such as through chat, email and/or blogs) within a given organization about vertically strategic initiatives that have a global reach throughout said organization. The type of information that can easily be conveyed is the purpose of the initiative and what it is aiming to achieve, who is driving it, results achieved to date, and whom to speak to for more information. By providing this information on the intranet, staff can keep up-to-date with the strategic focus of their organization. For example, when Nestlé had a number of food processing plants in Scandinavia, their central support system had to deal with a number of queries every day. When Nestlé decided to invest in an intranet, they quickly realized the savings. Gerry McGovern says that the savings from the reduction in query calls was substantially greater than the investment in the intranet. Users can view information and data via a web browser rather than maintaining physical documents such as procedure manuals, internal phone list and requisition forms. This can potentially save the business money on printing, duplicating documents, and the environment, as well as document maintenance overhead. For example, the HRM company PeopleSoft "derived significant cost savings by shifting HR processes to the intranet". McGovern goes on to say the manual cost of enrolling in benefits was found to be US$109.48 per enrollment. "Shifting this process to the intranet reduced the cost per enrollment to $21.79; a saving of 80 percent". Another company that saved money on expense reports was Cisco. "In 1996, Cisco processed 54,000 reports and the amount of dollars processed was USD19 million". Many companies dictate computer specifications which, in turn, may allow Intranet developers to write applications that only have to work on one browser such that there are no cross-browser compatibility issues. Being able to specifically address one's "viewer" is a great advantage. Since intranets are user-specific (requiring database/network authentication prior to access), users know exactly who they are interfacing with and can personalize their intranet based on role (job title, department) or individual ("Congratulations Jane, on your 3rd year with our company!"). Since "involvement in decision making" is one of the main drivers of employee engagement, offering tools (like forums or surveys) that foster peer-to-peer collaboration and employee participation can make employees feel more valued and involved. == Planning and creation == Most organizations devote considerable resources into the planning and implementation of their intranet as it is of strategic importance to the organization's success. Some of the planning would include topics such as determining the purpose and goals of the intranet, identifying persons or departments responsible for implementation and management and devising functional plans, page layouts and designs. The appropriate staff would also ensure that implementation schedules and phase-out of existing systems were organized, while defining and implementing security of the intranet and ensuring it lies within legal boundaries and other constraints. In order to produce a high-value end product, systems planners should determine the level of interactivity (e.g. wikis, on-line forms) desired. Planners may also consider whether the input of new data and updating of existing data is to be centrally controlled or devolve. These decisions sit alongside to the hardware and software considerations (like content management systems), participation issues (like good taste, harassment, confidentiality), and features to be supported. Intranets are often static sites; they are a shared drive, serving up centrally stored documents alongside internal articles or communications (often one-way communication). By leveraging firms which specialise in 'social' intranets, organisations are beginning to think of how their intranets can become a 'communication hub' for their entire team. The actual implementation would include steps such as securing senior management support and funding, conducting a business requirement analysis and identifying users' information needs. From the technical perspective, there would need to be a coordinated installation of the web server and user access netw

    Read more →
  • Online Safety Amendment (Social Media Minimum Age) Act 2024

    Online Safety Amendment (Social Media Minimum Age) Act 2024

    The Online Safety Amendment (Social Media Minimum Age) Act 2024 is an Australian act of parliament that prohibits minors under the age of 16 from holding an account on certain social media platforms. It is an amendment to the Online Safety Act 2021 and was passed by the Parliament of Australia on 29 November 2024. It imposes monetary penalties on social media companies that fail to take reasonable steps to prevent minors under 16 that are located in Australia from having accounts on their services. The legislation allows the government to determine which social media platforms must ban age‑restricted users and proclaim a date for the commencement of the ban, with those provisions taking effect on 10 December 2025. Facebook, Instagram, Reddit, Snapchat, TikTok, Twitter, Threads, Twitch, Kick, and YouTube were age‑restricted on 10 December 2025, with the possibility that more platforms may be added. The act is being challenged in the High Court by the Digital Freedom Project. == Background == The ban on access to social media by young people by the federal government originated in November 2023, when shadow communications minister David Coleman introduced a private member's bill requiring the government to conduct a trial for age-verification technology on pornography and social media platforms. While the bill did not succeed, the Albanese government funded the trial in the 2024 Australian federal budget. In June 2024, opposition leader Peter Dutton pledged that a Coalition government would implement a ban on social media for under-16s within 100 days of taking office. The following month, prime minister Anthony Albanese announced the government would introduce legislation banning under-16s from social media. The Online Safety Amendment (Social Media Minimum Age) Bill 2024 was introduced into parliament by minister for communications Michelle Rowland on 21 November 2024, passing both houses on 28 November 2024. The ban on access to social media by young people by the federal government also gained momentum following an entreaty by the wife of the premier of South Australia, Peter Malinauskas, to her husband. She requested that he read The Anxious Generation by Jonathan Haidt and take action to address the impact of social media on the mental health of children. The couple have four young children, and, thinking of them, the premier thought that government should play a part in helping parents to regulate use of social media by their children at home. Malinauskas contacted former High Court chief justice Robert French, who agreed to look at the issue, and in September 2024 handed the premier a 267 page proposal, which he dubbed a "Swiss Army knife" rather than a machete, to adjust to social media's "changing landscape and its complexity". The leaders of other states and territories gave their support to Malinauskas's idea, and he took the French report to National Cabinet to collaborate with chief ministers, premiers, and the prime minister. Community support swelled after stories of parents who had lost their children to suicide after being bullied on social media were published. Albanese himself was moved by a personal letter received from Kelly O'Brien, whose 12-year-old daughter Charlotte had taken her own life due to bullying at school. An event took place at the sidelines of the United Nations General Assembly session in September 2025 at which a mother spoke of her daughter's suicide as "death by bullying ... enabled by social media". The speech won support from world leaders in Greece, Fiji, Tonga and the president of the European Commission Ursula von der Leyen. In early September 2024, South Australia proposed legislation similar to the federal law now in place. The state-based version was intended to ban users under the age of 14, unlike the federal law, which bans those under 16. The state-based law also proposed to require parental consent for 14 and 15‑year‑olds. Later in September, prime minister Anthony Albanese announced that his government intended to introduce legislation to set a minimum age requirement for social media. In November 2024, the federal government indicated their intention to engage the Age Check Certification Scheme following a tender process for an age assurance technology trial. The Albanese government's proposed ban was supported by the governments of every state and territory. Albanese described social media as a "scourge", and said "I want people to spend more time on the footy field or the netball court than they're spending on their phones", that family members are "worried sick about the safety of our kids online", and that social media "is having a negative impact on young people's mental health and on anxiety". Albanese's statements followed an earlier pledge by Liberal opposition leader Peter Dutton who was pushed by the early advocacy of shadow communications minister David Coleman to implement a ban on social media for under 16s within 100 days of being elected. The opposition organised an open letter signed by 140 experts who specialise in child welfare and technology. The opposition was concerned about the invasion of privacy that will occur with the introduction of identification-based age checks. An advocacy group for digital companies in Australia called the plans a "20th Century response to 21st Century challenges". A director of a mental health service voiced concerns, stating that "73% of young people across Australia who accessed mental health support did so through social media". == Implementation == Social media companies will receive a transition period of one year after the legislation is enacted to introduce reasonable controls preventing minors under the age of 16 from holding accounts on their services while physically located in Australia. Enforcement will involve fines of up to A$49.5 million for companies failing to take such steps, with no consequences for parents and children who violate the restrictions. There are no parental consent exceptions to the ban, and while the use of virtual private networks (VPNs) to access these services remains legal in Australia, the services are expected to try to stop under 16s from using VPNs to pretend to be outside Australia. The expectation is to make best-efforts to implement the ban on platforms including Facebook, Instagram, Reddit, Snapchat, TikTok, Twitter, Threads, Twitch, Kick and YouTube. Some social media companies are now obligated to become good enough at profiling Australian children under 16 to satisfy the Australian government they tried to implement the ban to avoid being fined. Consequently, social media companies said they will try to identify restricted users using various methods including behavioural inferencing. On 5 November 2025, it was announced that online gaming platform Roblox will not be banned, but Reddit and live-streaming platform Kick will be added to the list of platforms to be banned. A report by Age Check Certification Scheme, a UK company recruited by the government to consult on the technology used to implement the restrictions, was issued in June 2025, ahead of the December deadline to implement the ban. In June 2025, the preliminary report was released, which stated that "there are no significant technological barriers" to implementing the ban. In late July 2025, Google warned that it would sue the Australian government if YouTube was included in the ban. On 30 July, the government announced that it would extend its social media age limit to include YouTube, following advice from Grant. On 30 July 2025, the minister for communications, Anika Wells, published the Online Safety (Age-Restricted Social Media Platforms) Rules 2025, which specify exactly which types of social media platforms will be banned for certain users. On 31 August 2025, the full report was released, which stated that it would technically be possible to implement the ban; however, coordination among different services is required to successfully implement it. It also highlighted the benefits and flaws of different methods of age verification. On 16 September 2025, it was announced that the eSafety Commissioner will be able to take legal action against social media companies that have not pursued reasonable steps to bar users under the age of 16, and that fines can range up to A$49.5 million against these companies in court. On 19 November 2025, Meta announced that from 4 December their platforms (Instagram, Facebook, and Threads) would be removing users under the age of 16 ahead of the 10 December deadline. Users will be able to scan a face or provide an identity document to prove their age. On 21 November 2025, the eSafety Commissioner announced that the live-streaming platform Twitch will be included in the ban, but that Pinterest would not be. In December 2025, eSafety Commissioner Julie Inman Grant suggested efforts to block users include use by social media companies of various "signals" to identify children that are

    Read more →