Random forest

Random forest

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set. The first algorithm for random decision forests was created in 1995 by Tin Kam Ho using the random subspace method, which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg. An extension of the algorithm was developed by Leo Breiman and Adele Cutler, who registered "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.). The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho and later independently by Amit and Geman in order to construct a collection of decision trees with controlled variance. == History == The general method of random decision forests was first proposed by Salzberg and Heath in 1993, with a method that used a randomized decision tree algorithm to create multiple trees and then combine them using majority voting. This idea was developed further by Ho in 1995. Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. This observation that a more complex classifier (a larger forest) gets more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination. The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho was also influential in the design of random forests. This method grows a forest of trees, and introduces variation among the trees by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich. The proper introduction of random forests was made in a paper by Leo Breiman, that has become one of the world's most cited papers. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular: Using out-of-bag error as an estimate of the generalization error. Measuring variable importance through permutation. The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation. == Algorithm == === Preliminaries: decision tree learning === Decision trees are a popular method for various machine learning tasks. Tree learning is almost "an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate". In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model. === Bagging === The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set X = x1, ..., xn with responses Y = y1, ..., yn, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples: After training, predictions for unseen samples x' can be made by averaging the predictions from all the individual regression trees on x': f ^ = 1 B ∑ b = 1 B f b ( x ′ ) {\displaystyle {\hat {f}}={\frac {1}{B}}\sum _{b=1}^{B}f_{b}(x')} or by taking the plurality vote in the case of classification trees. This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets. Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on x′: σ = ∑ b = 1 B ( f b ( x ′ ) − f ^ ) 2 B − 1 . {\displaystyle \sigma ={\sqrt {\frac {\sum _{b=1}^{B}(f_{b}(x')-{\hat {f}})^{2}}{B-1}}}.} The number B of samples (equivalently, of trees) is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. B can be optimized using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample. The training and test error tend to level off after some number of trees have been fit. === From bagging to random forests === The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho. Typically, for a classification problem with p {\displaystyle p} features, p {\displaystyle {\sqrt {p}}} (rounded down) features are used in each split. For regression problems the inventors recommend p / 3 {\displaystyle p/3} (rounded down) with a minimum node size of 5 as the default. In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem. === ExtraTrees === Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. As with ordinary random forests, they are an ensemble of individual trees, but there are two main differences: (1) each tree is trained using the whole learning sample (rather than a bootstrap sample), and (2) the top-down splitting is randomized: for each feature under consideration, a number of random cut-points are selected, instead of computing the locally optimal cut-point (based on, e.g., information gain or the Gini impurity). The values are chosen from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly chosen splits, the split that yields the highest score is chosen to split the node. Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are p {\displaystyle {\sqrt {p}}} for classification and p {\displaystyle p} for regression, where p {\displaystyle p} is the number of features in the model. === Random forests for high-dimensional data === The basic random forest procedure may

Telebirr

Telebirr (Amharic: ቴሌብር) is a mobile payment service developed and was launched by Ethio telecom, the state owned telecommunication and Internet service provider in Ethiopia. It took five months to develop the end-to-end service. It facilitates the delivery of cashless transactions. The platform deployed currently has the capacity of processing up to 100 transactions per second (TPS) and can be scaled up to 1000 TPS. The service is accessible via SMS, USSD, and smartphone applications. Telebirr works in five languages. == Services == Though the service is fully accessible for any customer of Ethio telecom, the users need to register through the mobile application called Telebirr or using an authorized agent or Ethio telecom shop or Unstructured Supplementary Service Data (USSD), 127# nationally. However, Telebirr also provides a “quick registration” by using any information that already exists in Ethio telecom's system.

Futel

Futel is a public arts organization in Portland, Oregon dedicated to preserving and maintaining public telephone hardware and offering free phone and basic information services. Futel was founded by Karl Anderson, a former software engineer, and Elijah St. Clair. == Technology == Karl Anderson stated that one motivation for the project was to explore the idea of urban furniture. Other reasons were to preserve an important part of hacker history, and to salvage and re-use manufactured items at the end of their lifecycle. The original Futel phones were set up in Portland, Oregon. The organization cleans and repurposes old public payphones which are often salvaged from Craigslist or scrappers. Using interface boxes, they are converted into VoIP phones which are made available publicly, with no cost for phone calls. Anderson has said the service runs on "Asterisk and OpenVPN and a lot of scripts." The payphones operate using publicly-available internet connections. The phones have automated phone trees and users can make a call to local social services, to a weather forecast line, or access local transit information. Volunteers act as telephone operators, offering information about the Futel service, or are available for conversation. Users using Futel's phones may also access voicemail boxes. The system has a "wildcard line" where people can listen to samples of audio left on the main voicemail line along with commentary from Anderson and others. == Network == In February 2021, there were 10 Futel phones in Portland and 3 in other cities. Phones were set up in Detroit and Ypsilanti, Michigan, and Long Beach, Washington. The organization has provided free phone service for a Portland-area homeless encampment after receiving funding from the Awesome Foundation. In 2019 the organization reported their phones being used to make 12,000 phone calls. Futel also said their usage went up and not down during the first year of the COVID-19 pandemic when they outfitted their phone kiosks with handwashing stations and used volunteers to keep the phones clean. The project is funded is primarily through grants and is staffed with volunteers. The project has inspired others such as the PhilTel project in Philadelphia and the RandTel project in Randolph, Vermont. Futel publishes a zine called Party Line.

Account verification

Account verification is the process of verifying that a new or existing account is owned and operated by a specified real individual or organization. A number of websites, for example social media websites, offer account verification services. Verified accounts are often visually distinguished by check mark icons or badges next to the names of individuals or organizations. Account verification can enhance the quality of online services, mitigating sockpuppetry, bots, trolling, spam, vandalism, fake news, disinformation and election interference. == History == Account verification was introduced by Twitter in June 2009, initially as a feature for public figures and accounts of interest, individuals in "music, acting, fashion, government, politics, religion, journalism, media, sports, business and other key interest areas". A similar verification system was adopted by Google+ in 2011, Facebook page in October 2015 (Available in United States, Canada, United Kingdom, Australia and New Zealand) Facebook profile and Facebook page in 2018 (Available in Worldwide) Instagram in 2014, and Pinterest in 2015. On YouTube, users are able to submit a request for a verification badge once they obtain 100,000 or more subscribers. It also has an "official artist" badge for musicians and bands. In July 2016, Twitter announced that, beyond public figures, any individual would be able to apply for account verification. This was temporarily suspended in February 2018, following a backlash over the verification of one of the organisers of the far-right Unite the Right rally due to a perception that verification conveys "credibility" or "importance". In March 2018, during a live-stream on Periscope, Jack Dorsey, co-founder and CEO of Twitter, discussed the idea of allowing any individual to get a verified account. Twitter reopened account verification applications in May 2021 after revamping their account verification criteria. This time offering notability criteria for the account categories of government, companies, brands, and organizations, news organizations and journalists, entertainment, sports and activists, organizers, and other influential individuals. Instagram began allowing users to request verification in August 2018. In April 2018, Mark Zuckerberg, co-founder and CEO of Facebook, announced that purchasers of political or issue-based advertisements would be required to verify their identities and locations. He also indicated that Facebook would require individuals who manage large pages to be verified. In May 2018, Kent Walker, senior vice president of Google, announced that, in the United States, purchasers of political-leaning advertisements would need to verify their identities. In November 2022, Elon Musk included a blue verification check mark with a paid Twitter Blue monthly membership. Prior to Musk's acquisition of Twitter, Twitter offered this check mark at no charge to confirmed high profile users. On December 19, 2022, Twitter introduced two new check mark colors: gold for accounts from official businesses and organizations, and grey for accounts from governments or multilateral organizations. The type of check mark can be confirmed by visiting the profile page, then clicking or tapping on the check mark. == Techniques == === Identity verification services === Identity verification services are third-party solutions which can be used to ensure that a person provides information which is associated with the identity of a real person. Such services may verify the authenticity of identity documents such as drivers licenses or passports, called documentary verification, or may verify identity information against authoritative sources such as credit bureaus or government data, called nondocumentary verification. === Identity documents verification === The uploading of scanned or photographed identity documents is a practice in use, for example, at Facebook. According to Facebook, there are two reasons that a person would be asked to send a scan of or photograph of an ID to Facebook: to show account ownership and to confirm their name. In January 2018, Facebook purchased Confirm.io, a startup that was advancing technologies to verify the authenticity of identification documentation. === Biometric verification === === Behavioral verification === Behavioral verification is the computer-aided and automated detection and analysis of behaviors and patterns of behavior to verify accounts. Behaviors to detect include those of sockpuppets, bots, cyborgs, trolls, spammers, vandals, and sources and spreaders of fake news, disinformation and election interference. Behavioral verification processes can flag accounts as suspicious, exclude accounts from suspicion, or offer corroborating evidence for processes of account verification. === Bank account verification === Identity verification is required to establish bank accounts and other financial accounts in many jurisdictions. Verifying identity in the financial sector is often required by regulation such as Know Your Customer or Customer Identification Program. Accordingly, bank accounts can be of use as corroborating evidence when performing account verification. Bank account information can be provided when creating or verifying an account or when making a purchase. === Postal address verification === Postal address information can be provided when creating or verifying an account or when making and subsequently shipping a purchase. A hyperlink or code can be sent to a user by mail, recipients entering it on a website verifying their postal address. === Telephone number verification === A telephone number can be provided when creating or verifying an account or added to an account to obtain a set of features. During the process of verifying a telephone number, a confirmation code is sent to a phone number specified by a user, for example in an SMS message sent to a mobile phone. As the user receives the code sent, they can enter it on the website to confirm their receipt. === Email verification === An email account is often required to create an account. During this process, a confirmation hyperlink is sent in an email message to an email address specified by a person. The email recipient is instructed in the email message to navigate to the provided confirmation hyperlink if and only if they are the person creating an account. The act of navigating to the hyperlink confirms receipt of the email by the person. The added value of an email account for purposes of account verification depends upon the process of account verification performed by the specific email service provider. === Multi-factor verification === Multi-factor account verification is account verification which simultaneously utilizes a number of techniques. === Multi-party verification === The processes of account verification utilized by multiple service providers can corroborate one another. OpenID Connect includes a user information protocol which can be used to link multiple accounts, corroborating user information. == Account verification and good standing == On some services, account verification is synonymous with good standing. Twitter reserves the right to remove account verification from users' accounts at any time without notice. Reasons for removal may reflect behaviors on and off Twitter and include: promoting hate and/or violence against, or directly attacking or threatening other people on the basis of race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, or disease; supporting organizations or individuals that promote the above; inciting or engaging in the harassment of others; violence and dangerous behavior; directly or indirectly threatening or encouraging any form of physical violence against an individual or any group of people, including threatening or promoting terrorism; violent, gruesome, shocking, or disturbing imagery; self-harm, suicide; and engaging in other activity on Twitter that violates the Twitter Rules. In April 2023, Blue ticks were removed from all Twitter accounts that had not subscribed to Twitter Blue.

Global digital divide

The global digital divide describes global disparities, primarily between developed and developing countries, in regards to access to computing and information resources such as the Internet and the opportunities derived from such access. The Internet is expanding very quickly, and not all countries—especially developing countries—can keep up with the constant changes. The term "digital divide" does not necessarily mean that someone does not have technology; it could mean that there is simply a difference in technology. These differences can refer to, for example, high-quality computers, fast Internet, technical assistance, or telephone services. == Statistics == There is a large inequality worldwide in terms of the distribution of installed telecommunication bandwidth. In 2014 only three countries (China, US, Japan) host 50% of the globally installed bandwidth potential (see pie-chart Figure on the right). This concentration is not new, as historically only ten countries have hosted 70–75% of the global telecommunication capacity (see Figure). The U.S. lost its global leadership in terms of installed bandwidth in 2011, being replaced by China, which hosts more than twice as much national bandwidth potential in 2014 (29% versus 13% of the global total). == Versus the digital divide == The global digital divide is a special case of the digital divide; the focus is set on the fact that "Internet has developed unevenly throughout the world" causing some countries to fall behind in technology, education, labor, democracy, and tourism. The concept of the digital divide was originally popularized regarding the disparity in Internet access between rural and urban areas of the United States of America; the global digital divide mirrors this disparity on an international scale. The global digital divide also contributes to the inequality of access to goods and services available through technology. Computers and the Internet provide users with improved education, which can lead to higher wages; the people living in nations with limited access are therefore disadvantaged. This global divide is often characterized as falling along what is sometimes called the North–South divide of "northern" wealthier nations and "southern" poorer ones. == Obstacles to a solution == Some people argue that necessities need to be considered before achieving digital inclusion, such as an ample food supply and quality health care. Minimizing the global digital divide requires considering and addressing the following types of access: === Physical access === Involves "the distribution of ICT devices per capita…and land lines per thousands". Individuals need to obtain access to computers, landlines, and networks in order to access the Internet. This access barrier is also addressed in Article 21 of the convention on the Rights of Persons with Disabilities by the United Nations. === Financial access === The cost of ICT devices, traffic, applications, technician and educator training, software, maintenance, and infrastructures require ongoing financial means. Financial access and "the levels of household income play a significant role in widening the gap". === Socio-demographic access === Empirical tests have identified that several socio-demographic characteristics foster or limit ICT access and usage. Among different countries, educational levels and income are the most powerful explanatory variables, with age being a third one. While a Global Gender Gap in access and usage of ICT's exist, empirical evidence shows that this is due to unfavorable conditions concerning employment, education and income and not to technophobia or lower ability. In the contexts understudy, women with the prerequisites for access and usage turned out to be more active users of digital tools than men. In the US, for example, the figures for 2018 show 89% of men and 88% of women use the Internet. === Cognitive access === In order to use computer technology, a certain level of information literacy is needed. Further challenges include information overload and the ability to find and use reliable information. === Design access === Computers need to be accessible to individuals with different learning and physical abilities including complying with Section 508 of the Rehabilitation Act as amended by the Workforce Investment Act of 1998 in the United States. === Institutional access === In illustrating institutional access, Wilson states "the numbers of users are greatly affected by whether access is offered only through individual homes or whether it is offered through schools, community centers, religious institutions, cybercafés, or post offices, especially in poor countries where computer access at work or home is highly limited". === Political access === Guillen & Suarez argue that "democratic political regimes enable faster growth of the Internet than authoritarian or totalitarian regimes." The Internet is considered a form of e-democracy, and attempting to control what citizens can or cannot view is in contradiction to this. Recently situations in Iran and China have denied people the ability to access certain websites and disseminate information. Iran has prohibited the use of high-speed Internet in the country and has removed many satellite dishes in order to prevent the influence of Western culture, such as music and television. === Cultural access === Many experts claim that bridging the digital divide is not sufficient and that the images and language needed to be conveyed in a language and images that can be read across different cultural lines. A 2013 study conducted by Pew Research Center noted how participants taking the survey in Spanish were nearly twice as likely not to use the internet. == Examples == In the early 21st century, residents of developed countries enjoy many Internet services which are not yet widely available in developing countries, including: Mobile phones and small electronic communication devices; E-communities and social-networking; Fast broadband Internet connections, enabling advanced Internet applications; Affordable and widespread Internet access, either through personal computers at home or work, through public terminals in public libraries and Internet cafes, and through wireless access points; E-commerce enabled by efficient electronic payment networks like credit cards and reliable shipping services; Virtual globes featuring street maps searchable down to individual street addresses and detailed satellite and aerial photography; Online research systems which enable users to peruse newspaper and magazine articles that may be centuries old, without having to leave home; Electronic readers such as Kindle, Sony Reader, Samsung Papyrus and Iliad by iRex Technologies; Price engines which help consumers find the best possible online prices and similar services which find the best possible prices at local retailers; Electronic services delivery of government services, such as the ability to pay taxes, fees, and fines online. Further civic engagement through e-government and other sources such as finding information about candidates regarding political situations. == Proposed remedies == There are four specific arguments why it is important to "bridge the gap": Economic equality – For example, the telephone is often seen as one of the most important components, because having access to a working telephone can lead to higher safety. If there were to be an emergency, one could easily call for help if one could use a nearby phone. In another example, many work-related tasks are online, and people without access to the Internet may not be able to complete work up to company standards. The Internet is regarded by some as a basic component of civic life that developed countries ought to guarantee for their citizens. Additionally, welfare services, for example, are sometimes offered via the Internet. Social mobility – Computer and Internet use is regarded as being very important to development and success. However, some children are not getting as much technical education as others, because lower socioeconomic areas cannot afford to provide schools with computer facilities. For this reason, some kids are being separated and not receiving the same chance as others to be successful. Democracy – Some people believe that eliminating the digital divide would help countries become healthier democracies. They argue that communities would become much more involved in events such as elections or decision making. Economic growth – It is believed that less-developed nations could gain quick access to economic growth if the information infrastructure were to be developed and well used. By improving the latest technologies, certain countries and industries can gain a competitive advantage. While these four arguments are meant to lead to a solution to the digital divide, there are a couple of other components that need to be considered. The first one is rural living versus s

Test data management

Test data management (TDM) is a process in software testing concerned with the creation, preparation, and control of data used for testing software systems. It involves supplying datasets required to execute test cases and verifying system behaviour under defined conditions. Test data management is an integral part of the software development lifecycle (SDLC) and is utilized in both manual and automated testing processes. It is applied in environments that use continuous integration and DevOps practices, where test execution requires consistent and repeatable data conditions. == Overview == Test data management includes the generation, selection, and preparation of data for testing purposes, as well as its distribution across test environments. It also involves controlling data versions and ensuring that datasets correspond to specific test scenarios. In many cases, production data is adapted for testing through techniques such as masking or subsetting to reduce size and remove sensitive content. Test data management ensures that test cases are executed with relevant, consistent, and readily available data. This reduces variability in test results and supports reproducibility across test cycles. == Importance == The role of test data management has expanded with the growth of complex, data-driven systems and regulatory requirements governing data usage. Testing often depends on data that reflects real-world conditions, but direct use of production data may introduce security and privacy risks. As a result, organizations apply methods such as data masking and anonymization to meet compliance requirements, including those set by the California Privacy Rights Act (CPRA) and Europe’s General Data Protection Regulation (GDPR). Inadequate control of test data can lead to incomplete test coverage, unreliable test results, or delays in testing processes due to unavailable or inconsistent datasets. == Techniques and tools == Test data management leverages various techniques for preparing and controlling data used in testing. These include the generation of synthetic data, the extraction of subsets from production datasets, and the modification of data to remove or obscure sensitive information. A key technical requirement in these processes is maintaining referential integrity, or ensuring that relationships between data entities remain consistent across different tables and systems after masking or subsetting. Data virtualization is also used to provide access to datasets without full replication. These methods may be implemented using software tools that automate data preparation, masking, and distribution.

Information element

An information element, sometimes informally referred to as a field, is an item in Q.931 and Q.2931 messages, IEEE 802.11 management frames, and cellular network messages sent between a base transceiver station and a mobile phone or similar piece of user equipment. An information element is often a type–length–value item, containing 1) a type (which corresponds to the label of a field), a length indicator, and a value, although any combination of one or more of those parts is possible. A single message may contain multiple information elements. The abbreviation IE is found in many technical specification documents from 3GPP. It is not uncommon for a single specification document to contain thousands of references to IEs.