AI Pair Programmers: Free vs Paid (2026)

AI Pair Programmers: Free vs Paid (2026)

Trying to pick the best AI pair programmer? An AI pair programmer is software that uses machine learning to help you get more done — it scales effortlessly from a single task to thousands. The best picks balance beginner-friendly simplicity with the depth power users need, and they ship updates often. Whether you are a beginner or a pro, the right AI pair programmer slots into your workflow and pays for itself fast. This guide breaks down the top picks, their pros and cons, and who each one is best for.

Rclone

Rclone is an open source, multi threaded, command line computer program to manage or migrate content on cloud and other high latency storage. Its capabilities include sync, transfer, crypt, cache, union, compress and mount. The rclone website lists supported backends including S3 and Google Drive. Descriptions of rclone often carry the strapline "Rclone syncs your files to cloud storage". Those prior to 2020 include the alternative "Rsync for Cloud Storage". Rclone is well known for its rclone sync and rclone mount commands. It provides further management functions analogous to those ordinarily used for files on local disks, but which tolerate some intermittent and unreliable service. Rclone is commonly used with media servers such as Plex, Emby or Jellyfin to stream content direct from consumer file storage services. Official Ubuntu, Debian, Fedora, Gentoo, Arch, Brew, Chocolatey, and other package managers include rclone. == History == Nick Craig-Wood was inspired by rsync. Concerns about the noise and power costs arising from home computer servers prompted him to embrace cloud storage and he began developing rclone as open source software in 2012 under the name swiftsync. Rclone was promoted to stable version 1.00 in July 2014. In May 2017, Amazon Drive barred new users of rclone and other upload utilities, citing security concerns. Amazon Drive had been advertised as offering unlimited storage for £55 per year. Amazon's AWS S3 service continues to support new rclone users. The original rclone logo was updated in September 2018. In March 2020, Nick Craig-Wood resigned from Memset Ltd, a cloud hosting company he founded, to focus on open source software. Amazon's AWS April 2020 public sector blog explained how the Fred Hutch Cancer Research Center were using rclone in their Motuz tool to migrate very large biomedical research datasets in and out of AWS S3 object stores. In November 2020, rclone was updated to correct a weakness in the way it generated passwords. Passwords for encrypted remotes can be generated randomly by rclone or supplied by the user. In all versions of rclone from 1.49.0 to 1.53.2 the seed value for generated passwords was based on the number of seconds elapsed in the day, and therefore not truly random. CVE-2020-28924 recommended users upgrade to the latest version of rclone and check the passwords protecting their encrypted remotes. Release 1.55 of rclone in March 2021 included features sponsored by CERN and their CS3MESH4EOSC project. The work was EU funded to promote vendor-neutral application programming interfaces and protocols for synchronisation and sharing of academic data on cloud storage. == Backends and commands == Rclone supports the following services as backends. There are others, built on standard protocols such as WebDAV or S3, that work. WebDAV backends do not support rclone functionality dependent on server side checksum or modtime. Remotes are usually defined interactively from these backends, local disk, or memory (as S3), with rclone config. Rclone can further wrap those remotes with one or more of alias, chunk, compress, crypt or union, remotes. Once defined, the remotes are referenced by other rclone commands interchangeably with the local drive. Remote names are followed by a colon to distinguish them from local drives. For example, a remote example_remote containing a folder, or pseudofolder, myfolder is referred to within a command as a path example_remote:/myfolder. Rclone commands directly apply to remotes, or mount them for file access or streaming. With appropriate cache options the mount can be addressed as if a conventional, block level disk. Commands are provided to serve remotes over SFTP, HTTP, WebDAV, FTP and DLNA. Commands can have sub-commands and flags. Filters determine which files on a remote that rclone commands are applied to. rclone rc passes commands or new parameters to existing rclone sessions and has an experimental web browser interface. === Crypt remotes === Rclone's crypt implements encryption of files at rest in cloud storage. It layers an encrypted remote over a pre-existing, cloud or other remote. Crypt is commonly used to encrypt / decrypt media, for streaming, on consumer storage services such as Google Drive. Rclone's configuration file contains the crypt password. The password can be lightly obfuscated, or the whole rclone.conf file can be encrypted. Crypt can either encrypt file content and name, or additionally full paths. In the latter case there is a potential clash with encryption for cloud backends, such as Microsoft OneDrive, having limited path lengths. Crypt remotes do not encrypt object modification time or size. The encryption mechanism for content, name and path is available, for scrutiny, on the rclone website. Key derivation is with scrypt. === Example syntax (Linux) === These examples describe paths and file names but object keys behave similarly. To recursively copy files from directory remote_stuff, at the remote xmpl, to directory stuff in the home folder:- -v enables logging and -P, progress information. By default rclone checks the file integrity (hash) after copy; can retry each file up to three times if the operation is interrupted; uses up to four parallel transfer threads, and does not apply bandwidth throttling. Running the above command again copies any new or changed files at the remote to the local folder but, like default rsync behaviour, will not delete from the local directory, files which have been removed from the remote. To additionally delete files from the local folder which have been removed from the remote - more like the behaviour of rsync with a --delete flag:- And to delete files from the source after they have been transferred to the local directory - more like the behaviour of rsync with a --remove-source-file flag:- To mount the remote directory at a mountpoint in the pre-existing, empty stuff directory in the home directory (the ampersand at the end makes the mount command run as a background process):- Default rclone syntax can be modified. Alternative transfer, filter, conflict and backend specific flags are available. Performance choices include number of concurrent transfer threads; chunk size; bandwidth limit profiling, and cache aggression. == Academic evaluation == In 2018, University of Kentucky researchers published a conference paper comparing use of rclone and other command line, cloud data transfer agents for big data. The paper was published as a result of funding by the National Science Foundation. Later that year, University of Utah's Center for High Performance Computing examined the impact of rclone options on data transfer rates. == Rclone use at HPC research sites == Examples are University of Maryland, Iowa State University, Trinity College Dublin, NYU, BYU, Indiana University, CSC Finland, Utrecht University, University of Nebraska, University of Utah, North Carolina State University, Stony Brook, Tulane University, Washington State University, Georgia Tech, National Institutes of Health, Wharton, Yale, Harvard, Minnesota, Michigan State, Case Western Reserve University, University of South Dakota, Northern Arizona University, University of Pennsylvania, Stanford, University of Southern California, UC Santa Barbara, UC Irvine, UC Berkeley, and SURFnet. == Rclone and cybercrime == May 2020 reports stated rclone had been used by hackers to exploit Diebold Nixdorf ATMs with ProLock ransomware. The FBI issued a Flash Alert MI-000125-MW on May 4, 2020, in relation to the compromise. They issued a further, related alert 20200901–001 in September 2020. Attackers had exfiltrated / encrypted data from organisations involved in healthcare, construction, finance, and legal services. Multiple US government agencies, and industrial entities were affected. Researchers established the hackers spent about a month exploring the breached networks, using rclone to archive stolen data to cloud storage, before encrypting the target system. Reported targets included LaSalle County, and the city of Novi Sad. The FBI warned January 2021, in Private Industry Notification 20210106–001, of extortion activity using Egregor ransomware and rclone. Organisations worldwide had been threatened with public release of exfiltrated data. In some cases rclone had been disguised under the name svchost. Bookseller Barnes & Noble, US retailer Kmart, games developer Ubisoft and the Vancouver metro system have been reported as victims. An April 2021, cybersecurity investigation into SonicWall VPN zero-day vulnerability SNWLID-2021-0001 by FireEye's Mandiant team established attackers UNC2447 used rclone for reconnaissance and exfiltration of victims' files. Cybersecurity and Infrastructure Security Agency Analysis Report AR21-126A confirmed this use of rclone in FiveHands ransomware attacks. A June 2021, Microsoft Security Intelligence Twitter post identified use of rclone in BazaCall cyber attacks. The attackers sent emails e

Mountain car problem

Mountain Car, a standard testing domain in Reinforcement learning, is a problem in which an under-powered car must drive up a steep hill. Since gravity is stronger than the car's engine, even at full throttle, the car cannot simply accelerate up the steep slope. The car is situated in a valley and must learn to leverage potential energy by driving up the opposite hill before the car is able to make it to the goal at the top of the rightmost hill. The domain has been used as a test bed in various reinforcement learning papers. == Introduction == The mountain car problem, although fairly simple, is commonly applied because it requires a reinforcement learning agent to learn on two continuous variables: position and velocity. For any given state (position and velocity) of the car, the agent is given the possibility of driving left, driving right, or not using the engine at all. In the standard version of the problem, the agent receives a negative reward at every time step when the goal is not reached; the agent has no information about the goal until an initial success. == History == The mountain car problem appeared first in Andrew Moore's PhD thesis (1990). It was later more strictly defined in Singh and Sutton's reinforcement learning paper with eligibility traces. The problem became more widely studied when Sutton and Barto added it to their book Reinforcement Learning: An Introduction (1998). Throughout the years many versions of the problem have been used, such as those which modify the reward function, termination condition, and the start state. == Techniques used to solve mountain car == Q-learning and similar techniques for mapping discrete states to discrete actions need to be extended to be able to deal with the continuous state space of the problem. Approaches often fall into one of two categories, state space discretization or function approximation. === Discretization === In this approach, two continuous state variables are pushed into discrete states by bucketing each continuous variable into multiple discrete states. This approach works with properly tuned parameters but a disadvantage is information gathered from one state is not used to evaluate another state. Tile coding can be used to improve discretization and involves continuous variables mapping into sets of buckets offset from one another. Each step of training has a wider impact on the value function approximation because when the offset grids are summed, the information is diffused. === Function approximation === Function approximation is another way to solve the mountain car. By choosing a set of basis functions beforehand, or by generating them as the car drives, the agent can approximate the value function at each state. Unlike the step-wise version of the value function created with discretization, function approximation can more cleanly estimate the true smooth function of the mountain car domain. === Eligibility traces === One aspect of the problem involves the delay of actual reward. The agent is not able to learn about the goal until a successful completion. Given a naive approach for each trial the car can only backup the reward of the goal slightly. This is a problem for naive discretization because each discrete state will only be backed up once, taking a larger number of episodes to learn the problem. This problem can be alleviated via the mechanism of eligibility traces, which will automatically backup the reward given to states before, dramatically increasing the speed of learning. Eligibility traces can be viewed as a bridge from temporal difference learning methods to Monte Carlo methods. == Technical details == The mountain car problem has undergone many iterations. This section focuses on the standard well-defined version from Sutton (2008). === State variables === Two-dimensional continuous state space. V e l o c i t y = ( − 0.07 , 0.07 ) {\displaystyle Velocity=(-0.07,0.07)} P o s i t i o n = ( − 1.2 , 0.6 ) {\displaystyle Position=(-1.2,0.6)} === Actions === One-dimensional discrete action space. m o t o r = ( l e f t , n e u t r a l , r i g h t ) {\displaystyle motor=(left,neutral,right)} === Reward === For every time step: r e w a r d = − 1 {\displaystyle reward=-1} === Update function === For every time step: A c t i o n = [ − 1 , 0 , 1 ] {\displaystyle Action=[-1,0,1]} V e l o c i t y = V e l o c i t y + ( A c t i o n ) ∗ 0.001 + cos ⁡ ( 3 ∗ P o s i t i o n ) ∗ ( − 0.0025 ) {\displaystyle Velocity=Velocity+(Action)0.001+\cos(3Position)(-0.0025)} P o s i t i o n = P o s i t i o n + V e l o c i t y {\displaystyle Position=Position+Velocity} === Starting condition === Optionally, many implementations include randomness in both parameters to show better generalized learning. P o s i t i o n = − 0.5 {\displaystyle Position=-0.5} V e l o c i t y = 0.0 {\displaystyle Velocity=0.0} === Termination condition === End the simulation when: P o s i t i o n ≥ 0.6 {\displaystyle Position\geq 0.6} == Variations == There are many versions of the mountain car which deviate in different ways from the standard model. Variables that vary include but are not limited to changing the constants (gravity and steepness) of the problem so specific tuning for specific policies become irrelevant and altering the reward function to affect the agent's ability to learn in a different manner. An example is changing the reward to be equal to the distance from the goal, or changing the reward to zero everywhere and one at the goal. Additionally, a 3D mountain car can be used, with a 4D continuous state space.

Graphics processing unit

A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a component on a discrete graphics card or embedded on motherboards, mobile phones, personal computers, workstations, and game consoles. GPUs are increasingly being used for artificial intelligence (AI) processing due to linear algebra acceleration, which is also used extensively in graphics processing. Although there is no single definition of the term, and it may be used to describe any video display system, in modern use a GPU includes the ability to internally perform the calculations needed for various graphics tasks, like rotating and scaling 3D images, and often the additional ability to run custom programs known as shaders. This contrasts with earlier graphics controllers known as video display controllers which had no internal calculation capabilities, or blitters, which performed only basic memory movement operations. The modern GPU emerged during the 1990s, adding the ability to perform operations like drawing lines and text without CPU help, and later adding 3D functionality. Graphics functions are generally independent and this lends these tasks to being implemented on separate calculation engines. Modern GPUs include hundreds, or thousands, of calculation units. This made them useful for non-graphic calculations involving embarrassingly parallel problems due to their parallel structure. The ability of GPUs to rapidly perform vast numbers of calculations has led to their adoption in diverse fields including artificial intelligence (AI) where they excel at handling data-intensive and computationally demanding tasks. Other non-graphical uses include the training of neural networks and cryptocurrency mining. == History == === 1960s === Dedicated 3D graphics hardware dates back to graphic terminals such as the Adage AGT-30 from 1967 with analog matrix processors. In 1969 Evans & Sutherland (E&S) introduced the Line Drawing System-1 (LDS-1), which was the first all-digital system to provide matrix multiplication. Also in 1969, the low-cost graphics terminal IMLAC PDS-1 was introduced. It later saw use as an early 3D gaming machine with the likes of Maze War. === 1970s === In professional hardware, in 1972 PLATO IV system becomes operational at the University of Illinois Urbana-Champaign. Between around 1973 and 1978, several networked multiplayer wireframe 3D games are implemented and popularized by users of the system. Also in 1972, the E&S Continuous Tone 1 (CT1) "Watkins box" system (consisting of an E&S LDS-2 and Shaded Picture System) is delivered to Case Western Reserve University. It offered the first real-time Gouraud shading. In 1975, a joint effort between Evans & Sutherland Computer Corporation and the University of Utah's computer graphics department results in the first ever MOSFET video framebuffer, capable of color and smooth shading. E&S Continuous Tone 3 (CT3) system was delivered in 1977 to Lufthansa for pilot training using computer simulation. It was the first graphics system capable of real-time texture mapping. Ikonas made graphics systems with 8- and 24-bit graphics and 3D acceleration in the late 70s. Arcade system boards have used specialized 2D graphics circuits since the 1970s. In early video game hardware, RAM for frame buffers was expensive, so video chips composited data together as the display was being scanned out on the monitor. A specialized barrel shifter circuit helped the CPU animate the framebuffer graphics for various 1970s arcade video games from Midway and Taito, such as Gun Fight (1975), Sea Wolf (1976), and Space Invaders (1978). The Namco Galaxian arcade system in 1979 used specialized graphics hardware that supported RGB color, multi-colored sprites, and tilemap backgrounds. The Galaxian hardware was widely used during the golden age of arcade video games, by game companies such as Namco, Centuri, Gremlin, Irem, Konami, Midway, Nichibutsu, Sega, and Taito. The Atari 2600 in 1977 used a video shifter called the Television Interface Adaptor. Atari 8-bit computers (1979) had ANTIC, a video processor which interpreted instructions describing a "display list"—the way the scan lines map to specific bitmapped or character modes and where the memory is stored (so there did not need to be a contiguous frame buffer). 6502 machine code subroutines could be triggered on scan lines by setting a bit on a display list instruction. ANTIC also supported smooth vertical and horizontal scrolling independent of the CPU. === 1980s === In the 1980s significant advancements were made in professional 3D graphics hardware. Perhaps most impactful was the 1981 development of the Geometry Engine, a VLSI vector processor ASIC designed by Jim Clark and Marc Hannah at Stanford University. This processor is the forerunner of modern tensor cores and other similar processors marketed for graphics and AI. The Geometry Engine went on to be used in Silicon Graphics workstations for many years. Silicon Graphics's first product, shipped in November 1983, was the IRIS 1000, a terminal with hardware-accelerated 3D graphics based on the Geometry Engine. The Geometry Engine was capable of approximately 6 million operations per second. The 1981 NEC μPD7220 was the first implementation of a personal computer graphics display processor as a single large-scale integration (LSI) integrated circuit chip. This enabled the design of low-cost, high-performance video graphics cards such as those from Number Nine Visual Technology. It became the best-known GPU until the mid-1980s. It was the first fully integrated VLSI (very large-scale integration) metal–oxide–semiconductor (NMOS) graphics display processor for PCs, supported up to 1024×1024 resolution, and laid the foundations for the PC graphics market. It was used in a number of graphics cards and was licensed for clones such as the Intel 82720, the first of Intel's graphics processing units. The Williams Electronics arcade games Robotron: 2084, Joust, Sinistar, and Bubbles, all released in 1982, contain custom blitter chips for operating on 16-color bitmaps. In 1984, Hitachi released the ARTC HD63484, the first major CMOS graphics processor for personal computers. The ARTC could display up to 4K resolution when in monochrome mode. It was used in a number of graphics cards and terminals during the late 1980s. In 1985, the Amiga was released with a custom graphics chip called Agnus including a blitter for bitmap manipulation, line drawing, and area fill. It also included a coprocessor with its own simple instruction set, that was capable of manipulating graphics hardware registers in sync with the video beam (e.g. for per-scanline palette switches, sprite multiplexing, and hardware windowing), or driving the blitter. Also in 1985, IBM released the Professional Graphics Controller, designed by later to be Nvidia co-founder Curtis Priem, which was a rudimentary 3D card with 640 × 480 256-color graphics which used a dedicated CPU to draw graphics independently of the main system. It was used as the basis of cards by a number of makers (including Matrox) and its analog RGB signaling led directly to the VGA video standard. Priem later in the 80s worked on the influential Sun Microsystems GX (also known as cgsix) accelerated 2D graphics card. In 1986, Texas Instruments released the TMS34010, the first fully programmable graphics processor. It could run general-purpose code but also had a graphics-oriented instruction set. During 1990–1992, this chip became the basis of the Texas Instruments Graphics Architecture ("TIGA") Windows accelerator cards. Following in 1987, the IBM 8514 graphics system was released. It was one of the first video cards for IBM PC compatibles that implemented fixed-function 2D primitives in electronic hardware. Sharp's X68000, released in 1987, used a custom graphics chipset with a 65,536 color palette and hardware support for sprites, scrolling, and multiple playfields. It served as a development machine for Capcom's CP System arcade board. Fujitsu's FM Towns computer, released in 1989, had support for a 16,777,216 color palette. For context, IBM also introduced its Video Graphics Array (VGA) display system in 1987, with a maximum resolution of 640 × 480 pixels. Unlike 8514/A, VGA had no hardware acceleration features. In November 1988, NEC Home Electronics announced its creation of the Video Electronics Standards Association (VESA) to develop and promote a Super VGA (SVGA) computer display standard as a successor to VGA. Super VGA enabled graphics display resolutions up to 800 × 600 pixels, a 56% increase. In 1988 SGI sold IRIS workstation graphics with 10-12 Geometry Engines and introduced the IrisVision add-in board for IBM MicroChannel bus (RS/6000) based on the Geometry Engine as well. In 1988 as well, the first dedicated polygonal 3D graphics boards in arcade machines were introduced wit

Socially assistive robot

A socially assistive robot (SAR) aids users through social engagement and support rather than through physical tasks and interactions. == Background == The field of socially assistive robotics emerged in the early 2000s, following the emergence of the field of social robots. In contrast to social robots, SARs aid users with specific goals related to behavior change rather than serving as purely social entities. The term "Socially assistive robot" was initially defined by Maja Matarić and David Feil-Seifer in 2005. Since its inception, the field has gained substantial recognition, featuring numerous research projects, a wealth of global research publications, startup companies, and a growing array of products on the consumer market. The COVID-19 pandemic has underscored the immense potential of socially assistive robots, particularly in addressing the needs of large user populations, including children engaged in remote learning, elderly individuals grappling with loneliness, and those affected by social isolation and its associated negative consequences. == Characteristics of interaction == SARs rely on artificial intelligence (AI) to generate real-time, responsive, natural, and meaningful robot behaviors during interactions with humans. The robots employ various forms of communication, such as facial expressions, gestures, body movements, and speech. In contrast to robots intended for physical tasks, SARs are designed to support and motivate users to perform their own tasks. The tasks a user engages in can be physical (e.g., rehabilitation exercises for post-stroke users), cognitive (e.g., dementia screening for elderly users), or social (e.g., turn-taking for users with autism spectrum disorders). This complex interaction involves detecting and interpreting the user's movement, behavior, intent, goals, speech, and preferences. Machine learning and robot learning techniques are frequently employed to enhance the robot's understanding of the user, predict user preferences, and provide effective assistance. The effectiveness of socially assistive robots is assessed based on objective measurements of user performance and improvement resulting from the robot’s assistance and support. Unlike other branches of robotics, where effectiveness depends on the robot's physical task completion, SAR measures the success of the robot based on the user's progress and achievements. This evaluation is carried out using quantitative objective metrics, such as time spent on tasks, accuracy, retention, and verbalization, as well as quantitative subjective metrics, such as user survey tools. SAR is based on the large body of evidence showing that users tend to respond more positively to interactions with physical robots compared to interactions with screens. Interaction with physical robots also encourages users to learn and retain more information than screen-based interactions. This fundamental insight underlines why physical robots in SAR applications are more effective, as opposed to interactions solely involving screens, tablets, or computers. == Uses and applications == SARs have been developed and validated in a wide array of applications, including healthcare, elder care, education, and training. For example, SARs have been developed to support children on the autism spectrum in acquiring and practicing social and cognitive skills, to motivate and coach stroke patients throughout their rehabilitation exercises, monitoring individuals health (ex. fall detection), and to encourage elderly users to be more physically and socially active. There is a concern that technophobia and lack of trust in robots will pose a barrier to the effectiveness of SARs in older adults.

Plum Voice

The Plum Group, Inc. (DBA Plum Voice) is a company. Plum is headquartered in New York City with offices in Boston and Denver. == History == Plum Voice, founded in 2000 as The Plum Group, Inc., was incorporated to create technologies for personalized audio communication. By 2001, Plum had commercialized the open-standard Plum VoiceXML IVR platform which facilitated the creation of dynamic telecom applications. 2001 - Commercial launch of Plum VoiceXML IVR platform for customer-premises deployment 2002 - Launch of Plum Voice Hosting Centers for 24x7x365 managed IVR hosting 2004 - Plum Voice application suite receives a "Product of the Year" award from Customer Interactions magazine 2008 - Plum Survey builder launched, a do-it-yourself IVR survey tool. 2010 - Plum launched QuickFuse, a web-based rapid development platform used to create voice applications. 2013 - Plum launched VoiceTrends, an analytics and reporting toolkit designed specifically for voice applications. Plum achieves PCI-DSS Level 1. 2015 - Plum launched Plum Insight, a multi-channel (voice, web, mobile) survey platform. Plum achieves HIPAA compliance. 2016 - Plum launched a new version of QuickFuse called Fuse+. 2020 - Plum sunsets QuickFuse, rebrands Fuse+ as Plum Fuse.

Leakage (machine learning)

In statistics and machine learning, leakage (also known as data leakage or target leakage) refers to the use of information during model training that would not be available at prediction time. This results in overly optimistic performance estimates, as the model appears to perform better during evaluation than it actually would in a production environment. Leakage is often subtle and indirect, making it difficult to detect and eliminate. It can lead a statistician or modeler to select a suboptimal model, which may be outperformed by a leakage-free alternative. == Leakage modes == Leakage can occur at multiple stages of the machine learning workflow. Broadly, its sources can be divided into two categories: those arising from features and those arising from training examples. === Feature leakage === Feature or column-wise leakage is caused by the inclusion of columns which are one of the following: a duplicate label, a proxy for the label, or the label itself. These features, known as anachronisms, will not be available when the model is used for predictions, and result in leakage if included when the model is trained. For example, including a "MonthlySalary" column when predicting "YearlySalary"; or "MinutesLate" when predicting "IsLate". === Training example leakage === Row-wise leakage is caused by improper sharing of information between rows of data. Types of row-wise leakage include: Premature featurization; leaking from premature featurization before Cross-validation/Train/Test split (must fit MinMax/ngrams/etc on only the train split, then transform the test set) Duplicate rows between train/validation/test (for example, oversampling a dataset to pad its size before splitting; or, different rotations/augmentations of a single image; bootstrap sampling before splitting; or duplicating rows to up sample the minority class) Non-independent and identically distributed random (non-IID) data Time leakage (for example, splitting a time-series dataset randomly instead of newer data in test set using a train/test split or rolling-origin cross-validation) Group leakage—not including a grouping split column (for example, Andrew Ng's group had 100k x-rays of 30k patients, meaning ~3 images per patient. The paper used random splitting instead of ensuring that all images of a patient were in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays.) A 2023 review found data leakage to be "a widespread failure mode in machine-learning (ML)-based science", having affected at least 294 academic publications across 17 disciplines, and causing a potential reproducibility crisis. == Detection == Data leakage in machine learning can be detected through various methods, focusing on performance analysis, feature examination, data auditing, and model behavior analysis. Performance-wise, unusually high accuracy or significant discrepancies between training and test results often indicate leakage. Inconsistent cross-validation outcomes may also signal issues. Feature examination involves scrutinizing feature importance rankings and ensuring temporal integrity in time series data. A thorough audit of the data pipeline is crucial, reviewing pre-processing steps, feature engineering, and data splitting processes. Detecting duplicate entries across dataset splits is also important. For language models, the Min-K% method can detect the presence of data in a pretraining dataset. It presents a sentence suspected to be present in the pretraining dataset, and computes the log-likelihood of each token, then compute the average of the lowest K of these. If this exceeds a threshold, then the sentence is likely present. This method is improved by comparing against a baseline of the mean and variance. Analyzing model behavior can reveal leakage. Models relying heavily on counter-intuitive features or showing unexpected prediction patterns warrant investigation. Performance degradation over time when tested on new data may suggest earlier inflated metrics due to leakage. Advanced techniques include backward feature elimination, where suspicious features are temporarily removed to observe performance changes. Using a separate hold-out dataset for final validation before deployment is advisable.