Data exploration

Data exploration

Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems. These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. Data exploration is typically conducted using a combination of automated and manual activities. Automated activities can include data profiling or data visualization or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using spreadsheets or similar tools to view the raw data. All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic metadata (statistics, structure, relationships) for the data set that can be used in further analysis. Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (data cleansing), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining data quality. Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. Traditionally, this had been a key area of focus for statisticians, with John Tukey being a key evangelist in the field. Today, data exploration is more widespread and is the focus of data analysts and data scientists; the latter being a relatively new role within enterprises and larger organizations. == Interactive Data Exploration == This area of data exploration has become an area of interest in the field of machine learning. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include regression and classification or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning. By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques. == Software == Trifacta – a data preparation and analysis platform Paxata – self-service data preparation software Alteryx – data blending and advanced data analytics software Microsoft Power BI - interactive visualization and data analysis tool OpenRefine - a standalone open source desktop application for data clean-up and data transformation Tableau software – interactive data visualization software

Lexical choice

Lexical choice is the subtask of Natural language generation that involves choosing the content words (nouns, non-auxiliary verbs, adjectives, and adverbs) in a generated text. Function words (determiners, for example) are usually chosen during realisation. == Examples == The simplest type of lexical choice involves mapping a domain concept (perhaps represented in an ontology) to a word. For example, the concept Finger might be mapped to the word finger. A more complex situation is when a domain concept is expressed using different words in different situations. For example, the domain concept Value-Change can be expressed in many ways: The temperature rose: the verb rose is used for a Value-Change in temperature which increases the value. The temperature fell: the verb fell is used for a Value-Change in temperature which decreases the value. The rain got heavier: the phrase got heavier is used for a Value-Change in precipitation amount when the precipitation is rain. Sometimes words can communicate additional contextual information, for example: The temperature plummeted: the verb plummeted is used for a Value-Change in temperature which decreases the value, when the change is rapid and large. Contextual information is especially significant for vague terms such as tall. For example, a 2m tall man is tall, but a 2m tall horse is small. == Linguistic perspective == Lexical choice modules must be informed by linguistic knowledge of how the system's input data maps onto words. This is a question of semantics, but it is also influenced by syntactic factors (such as collocation effects) and pragmatic factors (such as context). Hence NLG systems need linguistic models of how meaning is mapped to words in the target domain (genre) of the NLG system. Genre tends to be very important; for example the verb veer has a very specific meaning in weather forecasts (wind direction is changing in a clockwise direction) which it does not have in general English, and a weather-forecast generator must be aware of this genre-specific meaning. In some cases there are major differences in how different people use the same word; for example, some people use by evening to mean 6PM and others use it to mean midnight. Psycholinguists have shown that when people speak to each other, they agree on a common interpretation via lexical alignment; this is not something which NLG systems can yet do. Ultimately, lexical choice must deal with the fundamental issue of how language relates to the non-linguistic world. For example, a system which chose colour terms such as red to describe objects in a digital image would need to know which RGB pixel values could generally be described as red; how this was influenced by visual (lighting, other objects in the scene) and linguistic (other objects being discussed) context; what pragmatic connotations were associated with red (for example, when an apple is called red, it is assumed to be ripe as well as have the colour red); and so forth. == Algorithms and models == A number of algorithms and models have been developed for lexical choice in the research community, for example Edmonds developed a model for choosing between near-synonyms (words with similar core meanings but different connotations). However such algorithms and models have not been widely used in applied NLG systems; such systems have instead often used quite simple computational models, and invested development effort in linguistic analysis instead of algorithm development.

Fingerprint scanner

Fingerprint scanners are a type of biometric security device that identify an individual by identifying the structure of their fingerprints. They are used in police stations, security industries, smartphones, and other mobile devices. == Fingerprints == People have patterns of friction ridges on their fingers, these patterns are called the fingerprints. Fingerprints are uniquely detailed, durable over an individual's lifetime, and difficult to alter. Due to the unique combinations, fingerprints have become an ideal means of identification. == Types of fingerprint scanners == There are four types of fingerprint scanners: Optical scanners take a visual image of the fingerprint using a digital camera. Capacitive or CMOS scanners use capacitors and thus electric current to form an image of the fingerprint. This type of scanner tends to excel in terms of precision. Ultrasonic fingerprint scanners use high frequency sound waves to penetrate the epidermal (outer) layer of the skin. Thermal scanners sense the temperature differences on the contact surface, in between fingerprint ridges and valleys. All fingerprint scanners are susceptible to spoofing through fingerprints replicated using photographs and 3D printing. == Construction forms == Each type of fingerprint sensor can take two basic forms: the stagnant and the moving fingerprint scanner. Stagnant: The scanning module is mounted statically, and the user is required to swipe their fingers across it. This is cheaper but also less reliable than the moving form. Imaging can be less than ideal if the finger is not dragged over the scanning area at constant speed. Moving: The scanning module is mounted on a movable surface, while the user's finger can remain static. Because this layout allows the scanning module to pass the fingerprint at a constant speed, this method is generally more reliable. == Form factors == === Peripherals === Add-on fingerprint readers for PCs initially appeared in the late 1990's in the form of PCMCIA modules. Microsoft released a model in its IntelliMouse line with an integrated fingerprint reader in 2005. === Integrated readers === Laptops with built-in readers emerged around the same time as peripheral readers with devices such as NECs MC/R730F. IBM produced laptops with integrated readers starting in 2004. Apple introduced fingerprint scanners to their devices under the name Touch ID in 2013. These were initially released on the iPhone 5S, with the technology remaining exclusive to iPhones until the release of the 2016 MacBook Pro. On both laptops and smartphones, the fingerprint sensor usually uses a USB or I2C interface internally.

Raseef22

Raseef22 (Arabic: رصيف22) is a liberal Arabic media network founded in 2013 based in Beirut, Lebanon. It publishes content in Arabic and English from different Arab states and describes itself as an independent media platform. International Media Support mentions Raseef22 along with HuffPost Arabic and Al Jazeera as one of the biggest Pan-Arab online platforms. == Name == The Arabic word raseef (رَصِيف) means platform or pavement, and the number 22 refers to the number of states in the Arab League. == History == Kareem Sakka co-founded Raseef22 in the aftermath of the Arab Spring, which he cites as a source of inspiration. In an article in The Washington Post, he wrote that Raseef22 was created as a "digital space for those eager to know what was going on around them." Raseef22 was one of the 500 websites censored in Egypt in late 2017 after it published an article on Egyptian security agencies' vies to influence the media. After the site was blocked in Egypt, it was targeted in a cyber attack that took it offline in locations around the world. Jamal Khashoggi wrote for Raseef22 regularly. One of his notable articles was "Notes on the Freedom of the Arabs from Oslo, Norway," published June 5, 2018. The site was blocked in Saudi Arabia December 2018 when the Saudi Ministry of Communications and Information Technology ordered its censorship due to its "unprecedented response to the assassination of Jamal Khashoggi in Istanbul." This decision might have also been related to Raseef22's coverage of Saudi-Israeli relations and interviews with activists later imprisoned or placed under house arrest coverage In 2019 the Association of LGBT Journalists (AJL) in Paris gave Raseef22 a golden foreign press award for its six-month series of articles on gender and sexuality issues. == Readership == According to its publisher in 2019, the news agency counted 12 million readers annually from 22 Arab nations. Of the readership, he wrote that it "believes in the talent and promise of the Arab mind and sees the ugliness of tyranny, patriarchy, misogyny and the futility of proxy rulers and wars." Al-Quds Al-Arabi described Raseef22 as "oriented to the youth."

Digital edition

A digital edition is an online magazine or online newspaper delivered in electronic form which is formatted identically to the print version. Digital editions are often called digital facsimiles to underline the likeness to the print version. Digital editions have the benefit of reduced cost to the publisher and reader by avoiding the time and the expense to print and deliver paper edition. This format is considered more environmentally friendly due to the reduction of paper and energy use. These editions also often feature interactive elements such as hyperlinks both within the publication itself and to other internet resources, search option and bookmarking, and can also incorporate multimedia such as video or animation to enhance articles themselves or for advertisement purposes. Some delivery methods also include animation and sound effects that replicate turning of the page to further enhance the experience of their print counterparts. Magazine publishers have traditionally relied on two revenue sources: selling ads and selling magazines. Additionally some publishers are using other electronic publication methods such as RSS to reach out to readers and inform them when new digital editions are available. Current technologies are generally either reader-based, requiring a download of an application and subsequent download of each edition, or browser-based, often using Macromedia Flash, requiring no application download (such as Adobe Acrobat). Some application-based readers allow users to access editions while not connected to internet. Dedicated hardware such as the Amazon Kindle and the iPad is also available for reading digital editions of select books, popular national magazines such as Time, The Atlantic, and Forbes and popular national newspapers such as the New York Times, Wall Street Journal, and Washington Post. Archives of print newspapers, in some cases dating hundreds of years back, are being digitized and made available online. Google is indexing existing digital archives produced by the newspapers themselves or by third parties. Newspaper and magazine archival began with microform film formats solving the problem of efficiently storing and preserving. This format, however, lacked accessibility. Many libraries, especially state libraries in the United States are archiving their collections digitally and converting existing microfilm to digital format. The Library of Congress provides project planning assistance and the National Endowment for the Humanities procures funding through grants from its National Digital Newspaper Program. Digital magazines, ezines, e-editions and emags are sometimes referred to as digital editions, however some of these formats are published only in digital format unlike digital editions which replicate a printed edition as well. == Digital magazines == Digital-replica magazines number in thousands—consumer and business publications, house magazines for associations, institutions and corporations – and conversion from print to digital was still increasing as of 2009. A 2008 report funded by digital-replica technology providers and auditing agencies counted 1,786 digital-replica editions having more than 7 million circulation among business-to-business publications, of which 230 editions were audited The same report counted 1,470 digital-replica editions of consumer magazines having 5.5 million digital circulation, of which 240 editions were audited. These authors estimated that by year end of 2009 there would be 8,000 digital magazines, having a combined distribution of more than 30 million people. Surveys have shown that, while not all subscribers prefer a digital edition, some do because of the environmental benefit and also because digital magazines are searchable and may easily be passed along or linked to. One such survey funded by a digital publisher reported on inputs from more than 30,000 subscribers to business, consumer and other digital magazines. == Digital magazine business models == === Reduced printing and distribution costs === The publishers' choice to save by moving some or all subscribers from print to digital is widely accepted. Oracle magazine, which has 176,000 of its 516,000 subscribers receiving digital according to its June 2009 BPA circulation statement, is said to be the most widely circulated digital edition of a business-to-business publication. Publishers who do this need to choose whether to make some issues all-digital, move some subscribers to digital edition, add some digital-only subscribers, or send all subscribers the digital edition. === Paid subscription revenue === In 2009, a major consumer magazine, PC Magazine, went all-digital, charging an annual subscription fee for its digital-replica edition. Many consumer magazines and newspapers are already available in eReader formats that are sold through booksellers. === Sponsorship and advertising revenue === Digital editions often carry special "front cover" advertising, or advertising on the email message alerting the subscriber of the digital edition. Publishers also produce special digital-only inserts and rich-media ads or advertorials. === Designed-for-digital issues === Another approach is to fully replace printed issues with digital ones, or to use digital editions for extra issues that would otherwise have to be printed.

Grokking (machine learning)

In machine learning, grokking, or delayed generalization, is a phenomenon observed in some settings where a model abruptly transitions from overfitting (performing well only on training data) to generalizing (performing well on both training and test data), after many training iterations with little or no improvement on the held-out data. This contrasts with what is typically observed in machine learning, where generalization occurs gradually alongside improved performance on training data. == Origin == Grokking was introduced by OpenAI researcher Alethea Power and colleagues in the January 2022 paper "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets". It is derived from the word grok coined by Robert Heinlein in his novel Stranger in a Strange Land. In ML research, "grokking" is not used as a synonym for "generalization"; rather, it names a sometimes-observed delayed‑generalization training phenomenon in which training and held‑out performance do not improve in tandem, and in which held‑out performance rises abruptly later. Authors also analyze the "grokking time", the epoch or step at which this transition occurs in those scenarios. == Interpretations == Grokking can be understood as a phase transition during the training process. In particular, recent work has shown that grokking may be due to a complexity phase transition in the model during training. While grokking has been thought of as largely a phenomenon of relatively shallow models, grokking has been observed in deep neural networks and non-neural models and is the subject of active research. One potential explanation is that the weight decay (a component of the loss function that penalizes higher values of the neural network parameters, also called regularization) slightly favors the general solution that involves lower weight values, but that is also harder to find. According to Neel Nanda, the process of learning the general solution may be gradual, even though the transition to the general solution occurs more suddenly later. Recent theories have hypothesized that grokking occurs when neural networks transition from a "lazy training" regime where the weights do not deviate far from initialization, to a "rich" regime where weights abruptly begin to move in task-relevant directions. Follow-up empirical and theoretical work has accumulated evidence in support of this perspective, and it offers a unifying view of earlier work as the transition from lazy to rich training dynamics is known to arise from properties of adaptive optimizers, weight decay, initial parameter weight norm, and more. This perspective is complementary to a unifying "pattern learning speeds" framework that links grokking and double descent; within this view, delayed generalization can arise across training time ("epoch‑wise") or across model size ("model‑wise"), and the authors report "model‑wise grokking".

Data communication

Data communication is the transfer of data over a point-to-point or point-to-multipoint communication channel. Data communication comprises data transmission and data reception and can be classified as analog transmission and digital communications. Analog data communication conveys voice, data, image, signal or video information using a continuous signal, which varies in amplitude, phase, or some other property. In baseband analog transmission, messages are represented by a sequence of pulses by means of a line code; in passband analog transmission, they are communicated by a limited set of continuously varying waveforms, using a digital modulation method. Passband modulation and demodulation are carried out by modem equipment. Digital transmission and digital reception are the transfer of either a digitized analog signal or a born-digital bitstream. Baseband digital transmission is regarded as comprising part of a digital signal, whereas passband transmission of digital data may also or alternatively be considered a form of digital-to-analog conversion. Data communication channels include copper wires, optical fibers, wireless communication using radio spectrum, storage media and computer buses. The data are represented as an electromagnetic signal, such as an electrical voltage, radiowave, microwave, or infrared signal. == Distinction between related subjects == Digital transmission or data transmission traditionally belongs to telecommunications and electrical engineering. Basic principles of data transmission may also be covered within the computer science or computer engineering topic of data communications, which also includes computer networking applications and communication protocols, for example, routing, switching and inter-process communication. Although the Transmission Control Protocol (TCP) involves transmission, TCP and other transport layer protocols are covered in computer networking but not discussed in a textbook or course about data transmission. In most textbooks, the term analog transmission only refers to the transmission of an analog message signal (without digitization) by means of an analog signal, either as a non-modulated baseband signal or as a passband signal using an analog modulation method such as AM or FM. It may also include analog-over-analog pulse modulated baseband signals such as pulse-width modulation. In a few books within the computer networking tradition, analog transmission also refers to passband transmission of bit-streams using digital modulation methods such as FSK, PSK and ASK. The theoretical aspects of data transmission are covered by information theory and coding theory. == Protocol layers and sub-topics == Courses and textbooks in the field of data transmission typically deal with the following OSI model protocol layers and topics: Layer 1, the physical layer: Channel coding including Digital modulation schemes Line coding schemes Forward error correction (FEC) codes Bit synchronization Multiplexing Equalization Channel models Layer 2, the data link layer: Channel access schemes, media access control (MAC) Packet mode communication and Frame synchronization Error detection and automatic repeat request (ARQ) Flow control Layer 6, the presentation layer: Source coding (digitization and data compression), and information theory. Cryptography (may occur at any layer) It is also common to deal with the cross-layer design of those three layers. == Applications and history == Data (mainly but not exclusively informational) has been sent via non-electronic (e.g. optical, acoustic, mechanical) means since the advent of communication. Analog signal data has been sent electronically since the advent of the telephone. However, the first data electromagnetic transmission applications in modern time were electrical telegraphy (1809) and teletypewriters (1906), which are both digital signals. The fundamental theoretical work in data transmission and information theory by Harry Nyquist, Ralph Hartley, Claude Shannon and others during the early 20th century, was done with these applications in mind. In the early 1960s, Paul Baran invented distributed adaptive message block switching for digital communication of voice messages using switches that were low-cost electronics. Donald Davies invented and implemented modern data communication during 1965–7, including packet switching, high-speed routers, communication protocols, hierarchical computer networks and the essence of the end-to-end principle. Baran's work did not include routers with software switches and communication protocols, nor the idea that users, rather than the network itself, would provide the reliability. Both were seminal contributions that influenced the development of computer networks. Data transmission is utilized in computers in computer buses and for communication with peripheral equipment via parallel ports and serial ports such as RS-232 (1969), FireWire (1995) and USB (1996). The principles of data transmission are also utilized in storage media for error detection and correction since 1951. The first practical method to overcome the problem of receiving data accurately by the receiver using digital code was the Barker code invented by Ronald Hugh Barker in 1952 and published in 1953. Data transmission is utilized in computer networking equipment such as modems (1940), local area network (LAN) adapters (1964), repeaters, repeater hubs, microwave links, wireless network access points (1997), etc. In telephone networks, digital communication is utilized for transferring many phone calls over the same copper cable or fiber cable by means of pulse-code modulation (PCM) in combination with time-division multiplexing (TDM) (1962). Telephone exchanges have become digital and software controlled, facilitating many value-added services. For example, the first AXE telephone exchange was presented in 1976. Digital communication to the end user using Integrated Services Digital Network (ISDN) services became available in the late 1980s. Since the end of the 1990s, broadband access techniques such as ADSL, Cable modems, fiber-to-the-building (FTTB) and fiber-to-the-home (FTTH) have become widespread to small offices and homes. The current tendency is to replace traditional telecommunication services with packet mode communication such as IP telephony and IPTV. Transmitting analog signals digitally allows for greater signal processing capability. The ability to process a communications signal means that errors caused by random processes can be detected and corrected. Digital signals can also be sampled instead of continuously monitored. The multiplexing of multiple digital signals is much simpler compared to the multiplexing of analog signals. Because of all these advantages, because of the vast demand to transmit computer data and the ability of digital communications to do so and because recent advances in wideband communication channels and solid-state electronics have allowed engineers to realize these advantages fully, digital communications have grown quickly. The digital revolution has also resulted in many digital telecommunication applications where the principles of data transmission are applied. Examples include second-generation (1991) and later cellular telephony, video conferencing, digital TV (1998), digital radio (1999), and telemetry. Data transmission, digital transmission or digital communications is the transfer of data over a point-to-point or point-to-multipoint communication channel. Examples of such channels include copper wires, optical fibers, wireless communication channels, storage media and computer buses. The data are represented as an electromagnetic signal, such as an electrical voltage, radio wave, microwave, or infrared light. While analog transmission is the transfer of a continuously varying analog signal over an analog channel, digital communication is the transfer of discrete messages over a digital or an analog channel. The messages are either represented by a sequence of pulses by means of a line code (baseband transmission) or by a limited set of continuously varying waveforms (passband transmission), using a digital modulation method. The passband modulation and corresponding demodulation (also known as detection) are carried out by modem equipment. According to the most common definition of a digital signal, both baseband and passband signals representing bit-streams are considered as digital transmission, while an alternative definition only considers the baseband signal as digital, and passband transmission of digital data as a form of digital-to-analog conversion. Data transmitted may be digital messages originating from a data source, for example, a computer or a keyboard. It may also be an analog signal, such as a phone call or a video signal, digitized into a bit-stream, for example,e using pulse-code modulation (PCM) or more advanced source coding (analog-to-digital conversion and