Change data capture

Change data capture

In databases, change data capture (CDC) is a set of software design patterns used to determine and track the data that has changed (the "deltas") so that action can be taken using the changed data. The result is a delta-driven dataset. CDC is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources. For instance it can be used for incremental update of data loading. CDC occurs often in data warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system. == Methodology == System developers can set up CDC mechanisms in a number of ways and in any one or a combination of system layers from application logic down to physical storage. In a simplified CDC context, one computer system has data believed to have changed from a previous point in time, and a second computer system needs to take action based on that changed data. The former is the source, the latter is the target. It is possible that the source and target are the same system physically, but that would not change the design pattern logically. Multiple CDC solutions can exist in a single system. === Timestamps on rows === Tables whose changes must be captured may have a column that represents the time of last change. Names such as LAST_UPDATE, LAST_MODIFIED, etc. are common. Any row in any table that has a timestamp in that column that is more recent than the last time data was captured is considered to have changed. Timestamps on rows are also frequently used for optimistic locking so this column is often available. === Version numbers on rows === Database designers give tables whose changes must be captured a column that contains a version number. Names such as VERSION_NUMBER, etc. are common. One technique is to mark each changed row with a version number. A current version is maintained for the table, or possibly a group of tables. This is stored in a supporting construct such as a reference table. When a change capture occurs, all data with the latest version number is considered to have changed. Once the change capture is complete, the reference table is updated with a new version number. (Do not confuse this technique with row-level versioning used for optimistic locking. For optimistic locking each row has an independent version number, typically a sequential counter. This allows a process to atomically update a row and increment its counter only if another process has not incremented the counter. But CDC cannot use row-level versions to find all changes unless it knows the original "starting" version of every row. This is impractical to maintain.) === Status indicators on rows === This technique can either supplement or complement timestamps and versioning. It can configure an alternative if, for example, a status column is set up on a table row indicating that the row has changed (e.g., a boolean column that, when set to true, indicates that the row has changed). Otherwise, it can act as a complement to the previous methods, indicating that a row, despite having a new version number or a later date, still shouldn't be updated on the target (for example, the data may require human validation). === Time/version/status on rows === This approach combines the three previously discussed methods. As noted, it is not uncommon to see multiple CDC solutions at work in a single system, however, the combination of time, version, and status provides a particularly powerful mechanism and programmers should utilize them as a trio where possible. The three elements are not redundant or superfluous. Using them together allows for such logic as, "Capture all data for version 2.1 that changed between 2005-06-01 00:00 and 2005-07-01 00:00 where the status code indicates it is ready for production." === Triggers on tables === May include a publish/subscribe pattern to communicate the changed data to multiple targets. In this approach, triggers log events that happen to the transactional table into another queue table that can later be "played back". For example, imagine an Accounts table, when transactions are taken against this table, triggers would fire that would then store a history of the event or even the deltas into a separate queue table. The queue table might have schema with the following fields: Id, TableName, RowId, Timestamp, Operation. The data inserted for our Account sample might be: 1, Accounts, 76, 2008-11-02 00:15, Update. More complicated designs might log the actual data that changed. This queue table could then be "played back" to replicate the data from the source system to a target. Data capture offers a challenge in that the structure, contents and use of a transaction log is specific to a database management system. Unlike data access, no standard exists for transaction logs. Most database management systems do not document the internal format of their transaction logs, although some provide programmatic interfaces to their transaction logs (for example: Oracle, DB2, SQL/MP, SQL/MX and SQL Server 2008). Other challenges in using transaction logs for change data capture include: Coordinating the reading of the transaction logs and the archiving of log files (database management software typically archives log files off-line on a regular basis). Translation between physical storage formats that are recorded in the transaction logs and the logical formats typically expected by database users (e.g., some transaction logs save only minimal buffer differences that are not directly useful for change consumers). Dealing with changes to the format of the transaction logs between versions of the database management system. Eliminating uncommitted changes that the database wrote to the transaction log and later rolled back. Dealing with changes to the metadata of tables in the database. CDC solutions based on transaction log files have distinct advantages that include: minimal impact on the database (even more so if one uses log shipping to process the logs on a dedicated host). no need for programmatic changes to the applications that use the database. low latency in acquiring changes. transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction. no need to change the database schema == Confounding factors == As often occurs in complex domains, the final solution to a CDC problem may have to balance many competing concerns. === Unsuitable source systems === Change data capture both increases in complexity and reduces in value if the source system saves metadata changes when the data itself is not modified. For example, some Data models track the user who last looked at but did not change the data in the same structure as the data. This results in noise in the Change Data Capture. === Tracking the capture === Actually tracking the changes depends on the data source. If the data is being persisted in a modern database then Change Data Capture is a simple matter of permissions. Two techniques are in common use: Tracking changes using database triggers Reading the transaction log as, or shortly after, it is written. If the data is not in a modern database, CDC becomes a programming challenge. === Push versus pull === Push: the source process creates a snapshot of changes within its own process and delivers rows downstream. The downstream process uses the snapshot, creates its own subset and delivers them to the next process. Pull: the target that is immediately downstream from the source, prepares a request for data from the source. The downstream target delivers the snapshot to the next target, as in the push model. === Alternatives === Sometimes the slowly changing dimension is used as an alternative method. CDC and SCD are similar in that both methods can detect changes in a data set. The most common forms of SCD are type 1 (overwrite), type 2 (maintain history) or 3 (only previous and current value). SCD 2 can be useful if history is needed in the target system. CDC overwrites in the target system (akin to SCD1), and is ideal when only the changed data needs to arrive at the target, i.e. a delta-driven dataset.

Anaconda (Python distribution)

Anaconda is an open source data science and artificial intelligence distribution platform for the Python programming language. Developed by Anaconda, Inc., an American company founded in 2012, the platform is used to develop and manage data science and AI projects. In 2024, Anaconda Inc. has about 300 employees and 45 million users. == History == Co-founded in Austin, Texas in 2012 as Continuum Analytics by Peter Wang and Travis Oliphant, Anaconda Inc. operates from the United States and Europe. Anaconda Inc. developed Conda, a cross-platform, language-agnostic binary package manager. It also launched PyData community workshops and the Jupyter Cloud Notebook service (Wakari.io). In 2013, it received funding from DARPA. In 2015, the company had two million users including 200 of the Fortune 500 companies and raised $24 million in a Series A funding round led by General Catalyst and BuildGroup. Anaconda secured an additional $30 million in funding in 2021. Continuum Analytics rebranded as Anaconda in 2017. That year, it announced the release of Anaconda Enterprise 5, an integration with Microsoft Azure, and had over 13 million users by year's end. In 2022, it released Anaconda Business; new integrations with Snowflake and others; and the open-source PyScript. It also acquired PythonAnywhere, while Anaconda's user base exceeded 30 million in 2022. In 2023, Anaconda released Python in Excel, a new integration with Microsoft Excel, and launched PyScript.com. The company made a series of investments in AI during 2024. That February, Anaconda partnered with IBM to import its repository of Python packages into Watsonx, IBM's generative AI platform. The same year, Anaconda joined IBM's AI Alliance and released an integration with Teradata and Lenovo. In 2024, Anaconda's user base reached 45 million users and Barry Libert was named company CEO, after serving on Anaconda's board of directors. He was succeeded as CEO in October 2025 by David DeSanto, who also became a company director. In May 2025, the company introduced the first unified AI platform for Open Source, Anaconda AI Platform, a central control for AI workflows that enables customization in Python-based enterprise AI development. That July, after reaching over $150 million in a Series C funding round, Anaconda was evaluated at about $1.5 billion. == Overview == Anaconda distribution comes with over 300 packages automatically installed, and over 7,500 additional open-source packages can be installed from the Anaconda repository as well as the Conda package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a graphical alternative to the command-line interface (CLI). Conda was developed to address dependency conflicts native to the pip package manager, which would automatically install any dependent Python packages without checking for conflicts with previously installed packages (until its version 20.3, which later implemented consistent dependency resolution). The Conda package manager's historical differentiation analyzed and resolved these installation conflicts. Anaconda is a distribution of the Python programming language (and previously also R) for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. Anaconda distribution includes data-science packages suitable for Windows, Linux, and macOS. Other company products include Anaconda Free, and subscription-based Starter, Business and Enterprise. Anaconda's business tier offers Package Security Manager. Package versions in Anaconda are managed by the package management system Conda, which was spun out as a separate open-source package as useful both independently and for applications other than Python. There is also a small, bootstrap version of Anaconda called Miniconda, which includes only Conda, Python, the packages they depend on, and a small number of other packages. Open source packages can be individually installed from the Anaconda repository, Anaconda Cloud (anaconda.org), or the user's own private repository or mirror, using the conda install command. Anaconda, Inc. compiles and builds the packages available in the Anaconda repository itself, and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit (Intel, Apple Silicon). Anything available on PyPI may be installed into a Conda environment using pip, and Conda will keep track of what it has installed and what pip has installed. Custom packages can be made using the conda build command, and can be shared with others by uploading them to Anaconda Cloud, PyPI or other repositories. The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python 3.7. However, it is possible to create new environments that include any version of Python packaged with Conda. === Anaconda Navigator === Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that allows users to launch applications and manage Conda packages, environments and channels without using command-line commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the packages and update them. It is available for Windows, macOS and Linux. The following applications are available by default in Navigator: JupyterLab Jupyter Notebook QtConsole Spyder Glue Orange RStudio Visual Studio Code === Conda === Conda is an open source, cross-platform, language-agnostic package manager and environment management system that installs, runs, and updates packages and their dependencies. It was created for Python programs, but it can package and distribute software for any language, including multi-language projects. The Conda package and environment manager is included in all versions of Anaconda, Miniconda, and Anaconda Repository. == Anaconda.org == Anaconda Cloud is a package management service by Anaconda where users can find, access, store and share public and private notebooks, environments, and Conda and PyPI packages. Cloud hosts useful Python packages, notebooks and environments for a wide variety of applications. Users do not need to log in or to have a Cloud account, to search for public packages, download and install them. Users can build new Conda packages using Conda-build and then use the Anaconda Client CLI to upload packages to Anaconda.org. Notebooks users can be aided with writing and debugging code with Anaconda's AI Assistant.

Josh (app)

Josh (stylized as JOSH) was a video-sharing social networking service but it has since evolved into a live call and chat application owned by VerSe Innovation – an Indian technology company based in Bangalore, India. Josh was an Indian short video app that was launched in immediately after the Indian Government banned TikTok and other Chinese apps in June 2020. The founders of the platform have promoted the app as the “Instagram for Bharat” referring to their focus on the Indian audience that speaks its own regional and state languages. Josh was among the top 10 most downloaded apps social and entertainment apps in India of 2021 and had 150 million monthly active users as per April 2022. The word 'Josh' translates to fervour or passion. The app was launched under the aegis of the Atmanirbhar Bharat campaign and to compete with the duopoly of Google and Facebook in India. Josh's parent company VerSe Innovations Pvt. Ltd. owns another startup Dailyhunt, which a content and news aggregator application. Both Dailyhunt and Josh are a part of the VerSe's focus on the "next billion" regional language users of India. Founders Virendra Gupta and Umang Bedi conceptualised Josh as a short-video platform that made content creation accessible to vernacular language users, essentially the non-English speaking audience in India. == Features == Josh is currently available in 12 Indian languages and allows users to upload, share, remix bite-sized videos of up to 120 seconds. There are various categories across the video section including viral, trending, glamour, dance, devotion, yoga and cooking among others. Similar to Instagram and TikTok, it has a video feed which is curated for individuals on the basis of their app behaviour. The app hosts many daily, weekly and monthly social media challenges. == Funding == In December 2020, within 3 months of its launch, Josh's parent app VerSe Innovation raised more than $100 million from investors including Alphabet Inc's Google and Microsoft. In February 2021, VerSe Innovation raised $100 million in Series H funding from Qatar Investment Authority, the sovereign wealth fund of the State of Qatar, and Glade Brook Capital Partners. In August 2021, VerSe raised over $450 million in its Series I financing round with a valuation of $1 billion. Investors included Canada Pension Plan Investment Board (CPPIB), Siguler Guff, Baillie Gifford, Carlyle Asia Partners Growth II affiliates, and others. The startup announced its plan to expand overseas and broaden its ecommerce play for both Dailyhunt and Josh. In April 2022, VerSe announced that it has raised $805 million in funding from investors at a valuation of nearly $5 billion. ByteDance Offloads Stake In Josh Parent VerSe, Exits At 56% Discount == Partnerships == In February 2021, Saregama and Josh signed a music licensing deal, wherein Josh expanded its musical library with 1.3 lakh songs from Saregama in 25 different languages. To improve their user experience, Josh partnered with computer vision company D-ID in August 2021. The company helped Josh introduce photo-to-video features, live portrait technology, animate their photos etc. In order to solidify their efforts in enhancing Josh, VerSe acquired Indian social networking platform GolBol in October 2021. The move came as an effort by the startup to strengthen their discovery initiatives on the platform and classify content at scale and understand the core behaviour of Indian regional audiences. Josh has also announced its plans to include live commerce as a potential revenue stream through its partnership with multiple large e-commerce players. == Notable campaigns == Say No To Dowry – In association with Josh, the Kerala Police partook in the #SayNo2Dowry online social media campaign that was started to highlight and stop the social evil in the state. Salute India – Josh entered the Guinness World Records by creating the largest online video album of people saluting (29,529). It organised an online campaign #SaluteIndia on the app during the 75th Independence Day of India during 10–15 August 2021.

North Atlantic Population Project

The North Atlantic Population Project (NAPP) is a collaboration of historical demographers in Britain, Canada, Denmark, Germany, Iceland, Norway, and Sweden to produce a massive census microdata collection for the North Atlantic Region in the late-nineteenth century. The database includes complete individual-level census enumerations for each country, and provides information on over 110 million people. This large scale allows detailed analysis of small geographic areas and population subgroups. The NAPP database is designed to be compatible with the Integrated Public Use Microdata Series (IPUMS), and is disseminated through the IPUMS data-access system at the Minnesota Population Center, University of Minnesota. Major collaborators on the project include Lisa Dillon, University of Montreal; Chad Gaffield, University of Ottawa; Ólöf Garðarsdóttir, Statistics Iceland; Marianne Jarnes Erikstad, University of Tromsø; Jan Oldervall University of Bergen; Evan Roberts, University of Minnesota; Steven Ruggles, University of Minnesota; Kevin Schürer, UK Data Archive; Gunnar Thorvaldsen, University of Tromsø; and Matthew Woollard, UK Data Archive. The project is also coordinated by the Minnesota Population Center at the University of Minnesota.

Texture filtering

In computer graphics, texture filtering or texture smoothing is the method used to determine the texture color for a texture mapped pixel, using the colors of nearby texels (ie. pixels of the texture). Filtering describes how a texture is applied at many different shapes, size, angles and scales. Depending on the chosen filter algorithm, the result will show varying degrees of blurriness, detail, spatial aliasing, temporal aliasing and blocking. Depending on the circumstances, filtering can be performed in software (such as a software rendering package) or in hardware, eg. with either real time or GPU accelerated rendering circuits, or in a mixture of both. For most common interactive graphical applications, modern texture filtering is performed by dedicated hardware which optimizes memory access through memory cacheing and pre-fetch, and implements a selection of algorithms available to the user and developer. There are two main categories of texture filtering: magnification filtering and minification filtering. Depending on the situation, texture filtering is either a type of reconstruction filter where sparse data is interpolated to fill gaps (magnification), or a type of anti-aliasing (AA) where texture samples exist at a higher frequency than required for the sample frequency needed for texture fill (minification). There are many methods of texture filtering, which make different trade-offs between computational complexity, memory bandwidth and image quality. == The need for filtering == During the texture mapping process for any arbitrary 3D surface, a texture lookup takes place to find out where on the texture each pixel center falls. For texture-mapped polygonal surfaces composed of triangles typical of most surfaces in 3D games and movies, every pixel (or subordinate pixel sample) of that surface will be associated with some triangle(s) and a set of barycentric coordinates, which are used to provide a position within a texture. Such a position may not lie perfectly on the "pixel grid," necessitating some function to account for these cases. In other words, since the textured surface may be at an arbitrary distance and orientation relative to the viewer, one pixel does not usually correspond directly to one texel. Some form of filtering has to be applied to determine the best color for the pixel. Insufficient or incorrect filtering will show up in the image as artifacts (errors in the image), such as 'blockiness', jaggies, or shimmering. There can be different types of correspondence between a pixel and the texel/texels it represents on the screen. These depend on the position of the textured surface relative to the viewer, and different forms of filtering are needed in each case. Given a square texture mapped on to a square surface in the world, at some viewing distance the size of one screen pixel is exactly the same as one texel. Closer than that, the texels are larger than screen pixels, and need to be scaled up appropriately — a process known as texture magnification. Farther away, each texel is smaller than a pixel, and so one pixel covers multiple texels. In this case an appropriate color has to be picked based on the covered texels, via texture minification. Graphics APIs such as OpenGL allow the programmer to set different choices for minification and magnification filters. Note that even in the case where the pixels and texels are exactly the same size, one pixel will not necessarily match up exactly to one texel. It may be misaligned or rotated, and cover parts of up to four neighboring texels. Hence some form of filtering is still required. == Mipmapping == Mipmapping is a standard technique used to save some of the filtering work needed during texture minification. It is also highly beneficial for cache coherency - without it the memory access pattern during sampling from distant textures will exhibit extremely poor locality, adversely affecting performance even if no filtering is performed. During texture magnification, the number of texels that need to be looked up for any pixel is always four or fewer; during minification, however, as the textured polygon moves farther away potentially the entire texture might fall into a single pixel. This would necessitate reading all of its texels and combining their values to correctly determine the pixel color, a prohibitively expensive operation. Mipmapping avoids this by prefiltering the texture and storing it in smaller sizes down to a single pixel. As the textured surface moves farther away, the texture being applied switches to the prefiltered smaller size. Different sizes of the mipmap are referred to as 'levels', with Level 0 being the largest size (used closest to the viewer), and increasing levels used at increasing distances. == Filtering methods == This section lists the most common texture filtering methods, in increasing order of computational cost and image quality. === Nearest-neighbor interpolation === Nearest-neighbor interpolation is the simplest and crudest filtering method — it simply uses the color of the texel closest to the pixel center for the pixel color. While simple, this results in a large number of artifacts - texture 'blockiness' during magnification, and aliasing and shimmering during minification. This method is fast during magnification but during minification the stride through memory becomes arbitrarily large and it can often be less efficient than MIP-mapping due to the lack of spatially coherent texture access and cache-line reuse. === Nearest-neighbor with mipmapping === This method still uses nearest neighbor interpolation, but adds mipmapping — first the nearest mipmap level is chosen according to distance, then the nearest texel center is sampled to get the pixel color. This reduces the aliasing and shimmering significantly during minification but does not eliminate it entirely. In doing so it improves texture memory access and cache-line reuse through avoiding arbitrarily large access strides through texture memory during rasterization. This does not help with blockiness during magnification as each magnified texel will still appear as a large rectangle. === Linear mipmap filtering === Less commonly used, OpenGL and other APIs support nearest-neighbor sampling from individual mipmaps whilst linearly interpolating the two nearest mipmaps relevant to the sample. === Bilinear filtering === In Bilinear filtering, the four nearest texels to the pixel center are sampled (at the closest mipmap level), and their colors are combined by weighted average according to distance. This removes the 'blockiness' seen during magnification, as there is now a smooth gradient of color change from one texel to the next, instead of an abrupt jump as the pixel center crosses the texel boundary. Bilinear filtering for magnification filtering is common. When used for minification it is often used with mipmapping; though it can be used without, it would suffer the same aliasing and shimmering problems as nearest-neighbor filtering when minified too much. For modest minification ratios, however, it can be used as an inexpensive hardware accelerated weighted texture supersample. The Nintendo 64 used an unusual version of bilinear filtering where only three pixels are used known as 3-point texture filtering, instead of four due to hardware optimization concerns. This introduces a noticeable "triangulation bias" in some textures. === Trilinear filtering === Trilinear filtering is a remedy to a common artifact seen in mipmapped bilinearly filtered images: an abrupt and very noticeable change in quality at boundaries where the renderer switches from one mipmap level to the next. Trilinear filtering solves this by doing a texture lookup and bilinear filtering on the two closest mipmap levels (one higher and one lower quality), and then linearly interpolating the results. This results in a smooth degradation of texture quality as distance from the viewer increases, rather than a series of sudden drops. Of course, closer than Level 0 there is only one mipmap level available, and the algorithm reverts to bilinear filtering. === Anisotropic filtering === Anisotropic filtering is the highest quality filtering available in current consumer 3D graphics cards. Simpler, "isotropic" techniques use only square mipmaps which are then interpolated using bi– or trilinear filtering. (Isotropic means same in all directions, and hence is used to describe a system in which all the maps are squares rather than rectangles or other quadrilaterals.) When a surface is at a high angle relative to the camera, the fill area for a texture will not be approximately square. Consider the common case of a floor in a game: the fill area is far wider than it is tall. In this case, none of the square maps are a good fit. The result is blurriness and/or shimmering, depending on how the fit is chosen. Anisotropic filtering corrects this by sampling the texture as a non-square shape. The goal is

Social software engineering

Social software engineering (SSE) is a branch of software engineering that is concerned with the social aspects of software development and the developed software. SSE focuses on the socialness of both software engineering and developed software. On the one hand, the consideration of social factors in software engineering activities, processes and CASE tools is deemed to be useful to improve the quality of both development process and produced software. Examples include the role of situational awareness and multi-cultural factors in collaborative software development. On the other hand, the dynamicity of the social contexts in which software could operate (e.g., in a cloud environment) calls for engineering social adaptability as a runtime iterative activity. Examples include approaches which enable software to gather users' quality feedback and use it to adapt autonomously or semi-autonomously. SSE studies and builds socially-oriented tools to support collaboration and knowledge sharing in software engineering. SSE also investigates the adaptability of software to the dynamic social contexts in which it could operate and the involvement of clients and end-users in shaping software adaptation decisions at runtime. Social context includes norms, culture, roles and responsibilities, stakeholder's goals and interdependencies, end-users perception of the quality and appropriateness of each software behaviour, etc. The participants of the 1st International Workshop on Social Software Engineering and Applications (SoSEA 2008) proposed the following characterization: Community-centered: Software is produced and consumed by and/or for a community rather than focusing on individuals Collaboration/collectiveness: Exploiting the collaborative and collective capacity of human beings Companionship/relationship: Making explicit the various associations among people Human/social activities: Software is designed consciously to support human activities and to address social problems Social inclusion: Software should enable social inclusion enforcing links and trust in communities Thus, SSE can be defined as "the application of processes, methods, and tools to enable community-driven creation, management, deployment, and use of software in online environments". One of the main observations in the field of SSE is that the concepts, principles, and technologies made for social software applications are applicable to software development itself as software engineering is inherently a social activity. SSE is not limited to specific activities of software development. Accordingly, tools have been proposed supporting different parts of SSE, for instance, social system design or social requirements engineering. Consequently vertical market software, such as software development tools, engineering tools, marketing tools or software that helps users in a decision-making process can profit from social components. Such vertical social software differentiates strongly in its user-base from traditional social software such as Yammer.

Telebirr

Telebirr (Amharic: ቴሌብር) is a mobile payment service developed and was launched by Ethio telecom, the state owned telecommunication and Internet service provider in Ethiopia. It took five months to develop the end-to-end service. It facilitates the delivery of cashless transactions. The platform deployed currently has the capacity of processing up to 100 transactions per second (TPS) and can be scaled up to 1000 TPS. The service is accessible via SMS, USSD, and smartphone applications. Telebirr works in five languages. == Services == Though the service is fully accessible for any customer of Ethio telecom, the users need to register through the mobile application called Telebirr or using an authorized agent or Ethio telecom shop or Unstructured Supplementary Service Data (USSD), 127# nationally. However, Telebirr also provides a “quick registration” by using any information that already exists in Ethio telecom's system.