Microelectronics and Computer Technology Corporation, originally the Microelectronics and Computer Consortium and widely seen by the acronym MCC, was the first, and at one time one of the largest, computer industry research and development consortia in the United States. MCC ceased operations in 2000 and was formally dissolved in 2004. == Divisions == MCC did research and development in the following areas: [1] System Architecture and Design (optimise hardware and software design, provide for scalability and interoperability, allow rapid prototyping for improved time-to-market, and support the re-engineering of existing systems for open systems). Advanced Microelectronics Packaging and Interconnection (smaller, faster, more powerful, and cost-competitive). Hardware Systems Engineering (tools and methodologies for cost-efficient, up-front design of advanced electronic systems, including modelling and design-for-test techniques to improve cost, yield, quality, and time-to-market). Environmentally Conscious Technologies (process control and optimisation tools, information management and analysis capabilities, and non-hazardous material alternatives supporting cost-efficient production, waste minimisation, and reduced environmental impact). Distributed Information Technology (managing and maintaining physically distributed corporate information resources on different platforms, building blocks for the national information infrastructure, networking tools and services for integration within and between companies, and electronic commerce). Intelligent Systems (systems that "intelligently" support business processes and enhance performance, including decision support, data management, forecasting and prediction). == History == The MCC was a response to the announcement of Japan's Fifth Generation Project, a large Japanese research project launched in 1982 aimed at producing a new kind of computer by 1991. The Japanese had formed similar industrial research consortia as early as 1956.[2] Many European and American computer companies saw this new Japanese initiative as an attempt to take full control of the world's high-end computer market, and MCC was created, in part, as a defensive move against that threat. In late 1982, several major computer and semiconductor manufacturers in the United States banded together and founded MCC under the leadership of Admiral Bobby Ray Inman, whose previous positions had been Director of the National Security Agency and deputy director of the Central Intelligence Agency. Such formations were illegal in the United States until the 1984 Congressional passage of the "National Cooperative Research Act". Several sites with relevant universities were considered, including Atlanta, Georgia (Georgia Tech), the Research Triangle, N.C. (UNC), the Washington, D.C. area (George Mason), Stanford University and Austin, Texas (UT) which was the final selection. The University of Texas offered land upon which they would construct a new building specifically designed for the MCC within their Austin campus. Ross Perot also offered the use of his private plane for 2 years for staff recruitment. Austin was selected as the site for MCC in 1983. Despite this purpose and the background of Inman and his senior staff, MCC accepted no government funding for many years and was a refuge for some avoiding work on Strategic Defense Initiative projects. MCC was part of the Artificial Intelligence boom of the 1980s, reportedly the single largest customer of both Symbolics and Lisp Machines, Inc. (and like Symbolics, was one of the first companies to register a .com domain). In the 1980s its major programs were packaging, software engineering, CAD, and advanced computer architectures. The latter comprised artificial intelligence, human interface, database, and parallel processing, the latter two merging in the late 1980s. Many of the early shareholder companies were mainframe computer companies under stress in the 1980s. Over the years, MCC's membership diversified to include a broad range of high-profile corporations involved in information technology products, as well as government research and development agencies and leading universities. In June, 2000 the MCC Board of Directors voted to dissolve the consortium, and the few remaining employees held a wake at Scholz's Beer Garden in Austin on October 25. Formal dissolution papers were reportedly not filed until 2004. == Spinoffs == While multiple technologies were transferred to member companies and government agencies in the final years, fourteen companies were spun out of MCC. Those spinoffs include: TeraVicta Technologies, Austin's first MEMS company; its focus was to develop microscopic switch technology for fiber optic switching and radiofrequency switching in mobile phones specifically to dynamically switch between the future 3G-4GLTE-future5G wireless communication frequencies and ensure mobile phones were communicating over the strongest wireless signal to reduce dropped calls. Robert Miracky was the founding CEO who spun out the first commercial metal micromachining technology developed by MCC researchers Brent Lunceford, Jason Reed, Richard Nelson, K.Hu, and C. Hilbert in a collaborative development program with IBM in a novel implementation and operational paradigm for solid-state integrated circuit coolers integrated with conductive MEMS switches. TeraVicta was liquidated under Chapter 7 bankruptcy proceedings in 2015. The Austin region subsequently built up a MEMS & Sensors value chain in the billions of dollars comprising companies such as 3M, Cypress Semiconductor, NXP Semiconductor, Cirrus Logic, Silicon Labs, and the Austin division of the now-defunct Silicon Valley Technology Center. Portelligent, a company that provides reverse engineering teardown services. At the time, Portelligent was the first company to commercialize such services; they had been provided by MCC to its member companies. Today, there are at least twelve companies worldwide that sell reports known as "reverse engineering teardown reports." Modern day teardown reports provide detailed information about technology products such as the bill of materials, microchip, and printed circuit board design specifics, manufacturing details including manufacturing location details for the entire value chain responsible for making electronics, including the iPhone and Samsung Galaxy smartphones. Portelligent was acquired by CMP Technology in 2007. Evolutionary Technologies International, a company focused on developing database tools and data warehousing. It was spun off from MCC in 1990.
Stochastic parrot
In machine learning, the term stochastic parrot is a metaphor that frames large language models as systems that statistically mimic text without real understanding. The word "stochastic" – from the ancient Greek "στοχαστικός" (stokhastikos, 'based on guesswork') – is a term from probability theory meaning "randomly determined". The word "parrot" refers to parrots' ability to mimic human speech. The term was introduced in a 2021 paper on AI ethics titled "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" and authored by Timnit Gebru, Emily M. Bender, Angelina McMillan-Major, and Margaret Mitchell. The paper outlined possible risks associated with large language models (LLMs). In December 2020, it was the subject of a workplace dispute between Gebru (then co-leader of Google's Ethical Artificial Intelligence Team) and Google, which had requested the retraction of the paper. The incident culminated in Gebru's controversial departure from the company. The paper was later presented at the 2021 ACM Conference, and the term "stochastic parrot" has seen widespread use in academic research concerning generative AI and LLMs. The term has been interpreted negatively as an insult towards AI. == Background == Timnit Gebru is an AI ethics researcher, Emily M. Bender is a linguist specializing in computational linguistics, and Margaret Mitchell is a computer scientist specializing in algorithmic bias. Gebru had joined Google in 2018, where she co-led a team on the ethics of artificial intelligence with Mitchell. In late 2020, the paper "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜" was co-written by Gebru and five other researchers, four of whom were Google employees. The paper argues that large language models (LLMs) present significant risks such as environmental and financial costs, inscrutability leading to unknown dangerous biases, and potential for deception as LLMs do not understand the concepts underlying what they learn. The paper states that LLMs are "stitching together sequences of linguistic forms ... observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning." Therefore, they are labeled "stochastic parrots". === Dismissal of Gebru by Google === After the paper was submitted for consideration to the 2021 ACM Conference, Google requested that Gebru either retract the paper from the conference or remove the names of Google employees from it. Gebru refused to do so without further discussion, and emailed Google Research vice president Megan Kacholia that if the company could not explain the request for retraction and address other concerns regarding similar projects, she would plan to resign after a transition period, stating that they could "work on a last date". The following day, on December 2, 2020, Gebru received an email saying that Google was "accepting her resignation". Her abrupt firing sparked protests by Google employees and negative publicity for the company. == Usage == The phrase has been used by AI skeptics to signify that LLMs lack understanding of the meaning of their outputs. Sam Altman, CEO of OpenAI, used the term shortly after the release of ChatGPT in December 2022, tweeting "i am a stochastic parrot, and so r u". The term was nominated as the 2023 AI-related Word of the Year by the American Dialect Society. == Debate == Some LLMs, such as ChatGPT, have become capable of interacting with users in convincingly human-like conversations. The development of these new systems has deepened the discussion of the extent to which LLMs understand or are simply "parroting". According to machine learning researchers Lindholm, Wahlström, Lindsten, and Schön, the term "stochastic parrot" highlights two vital limitations of LLMs: LLMs are limited by the data they are trained on and are simply stochastically repeating contents of datasets. Because they are just making up outputs based on training data, LLMs do not understand if they are saying something incorrect or inappropriate. Lindholm et al. noted that, with poor quality datasets and other limitations, a learning machine might produce results that are "dangerously wrong". === Subjective experience === In the mind of a human being, words and language correspond to things one has experienced. For LLMs, according to proponents of the theory, words correspond only to other words and patterns of usage fed into their training data. Proponents of the idea of stochastic parrots thus conclude that statements about LLMs are due to "the human tendency to attribute meaning to text", and claim this occurs despite the LLMs not actually understanding language. === Fine-tuning === Kelsey Piper argued that the claim that LLMs are stochastic parrots or mere "next-token predictors" focuses on pre-training, ignoring that modern LLMs are also fine-tuned to follow instructions and to prefer accurate answers. === Hallucinations and mistakes === The tendency of LLMs to pass off false information as fact is held as support. Called hallucinations or confabulations, LLMs will occasionally synthesize information that matches some pattern. LLMs may fail to distinguish fact and fiction, which leads to the claim that they can't connect words to a comprehension of the world, as humans do. Furthermore, LLMs may fail to decipher complex or ambiguous grammar cases that rely on understanding the meaning of language. For example: The wet newspaper that fell down off the table is my favorite newspaper. But now that my favorite newspaper fired the editor I might not like reading it anymore. Can I replace 'my favorite newspaper' by 'the wet newspaper that fell down off the table' in the second sentence? GPT-4, an LLM released in March 2023, responded yes, not understanding that the meaning of "newspaper" is different in these two contexts; it is first an object and second an institution. === Benchmarks and experiments === One argument against the hypothesis that LLMs are stochastic parrot is their results on benchmarks for reasoning, common sense and language understanding. In 2023, some LLMs have shown good results on many language understanding tests, such as the Super General Language Understanding Evaluation (SuperGLUE). GPT-4 scored in the >90th-percentile on the Uniform Bar Examination and achieved 93% accuracy on the MATH benchmark of high-school Olympiad problems, results that exceed rote pattern-matching expectations. Such tests, and the smoothness of many LLM responses, help as many as 51% of AI professionals believe they can truly understand language with enough data, according to a 2022 survey. === Expert rebuttals === Some AI researchers dispute the notion that LLMs merely "parrot" their training data. Geoffrey Hinton, a pioneering figure in neural networks, counters that the metaphor misunderstands the prerequisite for accurate language prediction. He argues that "to predict the next word accurately, you have to understand the sentence", a view he presented on 60 Minutes in 2023. From this perspective, understanding is not an alternative to statistical prediction, but rather an emergent property required to perform it effectively at scale. Hinton also uses logical puzzles to demonstrate that LLMs actually understand language. A 2024 Scientific American investigation described a closed Berkeley workshop where state-of-the-art models solved novel tier-4 mathematics problems and produced coherent proofs, indicating reasoning abilities beyond memorization. The GPT-4 Technical Report showed human-level results on professional and academic exams (e.g., the Uniform Bar Exam and USMLE), challenging the "parrot" characterization. Anthropic conducted mechanistic interpretability research on Claude, using attribution graphs to identify circuits. The research showed how the LLM processes information via chains of fuzzy logical inference, and indicated an ability to plan ahead. They found that Claude 3.5 Haiku "employs remarkably general abstractions", forms "internally generated plans for its future outputs" and "works backwards from its longer-term goals". They noted that "The mechanisms of the model can apparently only be faithfully described using an overwhelmingly large causal graph." They also found that the model includes "mechanisms that could underlie a simple form of metacognition", in that it "thinks about" the level of its own knowledge before reaching its answer. === Interpretability === Another line of evidence against the 'stochastic parrot' claim comes from mechanistic interpretability, a research field dedicated to reverse-engineering LLMs to understand their internal workings. Rather than only observing the model's input-output behavior, these techniques probe the model's internal activations, which can be used to determine if they contain structured representations of the world. The goal is to investigate whether LLMs are merely manipulating surface statistics or if t
Data proliferation
Data proliferation refers to the prodigious amount of data, structured and unstructured, that businesses and governments continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers. While digital storage has become cheaper, the associated costs, from raw power to maintenance and from metadata to search engines, have not kept up with the proliferation of data. Although the power required to maintain a unit of data has fallen, the cost of facilities which house the digital storage has tended to rise. Data proliferation has been documented as a problem for the U.S. military since August 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems. Efforts to mitigate data proliferation and the problems associated with it are ongoing. == Problems caused == The problem of data proliferation is affecting all areas of commerce as a result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problem that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organizations. Data proliferation is problematic for several reasons: Difficulty when trying to find and retrieve information. At Xerox, on average it takes employees more than one hour per week to find hard-copy documents, costing $2,152 a year to manage and store them. For businesses with more than 10 employees, this increases to almost two hours per week at $5,760 per year. In large networks of primary and secondary data storage, problems finding electronic data are analogous to problems finding hard copy data. Data loss and legal liability when data is disorganized, not properly replicated, or cannot be found promptly. In April 2005, the Ameritrade Holding Corporation told 200,000 current and past customers that a tape containing confidential information had been lost or destroyed in transit. In May of the same year, Time Warner Incorporated reported that 40 tapes containing personal data on 600,000 current and former employees had been lost en route to a storage facility. In March 2005, a Florida judge hearing a $2.7 billion lawsuit against Morgan Stanley issued an "adverse inference order" against the company for "willful and gross abuse of its discovery obligations." The judge cited Morgan Stanley for repeatedly finding misplaced tapes of e-mail messages long after the company had claimed that it had turned over all such tapes to the court. Increased manpower requirements to manage increasingly chaotic data storage resources. Slower networks and application performance due to excess traffic as users search and search again for the material they need. High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run—not counting cooling costs. == Proposed solutions == Applications that better utilize modern technology Reductions in duplicate data (especially as caused by data movement) Improvement of metadata structures Improvement of file and storage transfer structures User education and discipline The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed.
Data verification
Data verification is a process in which different types of data are checked for accuracy and inconsistencies after data migration is done. In some domains it is referred to Source Data Verification (SDV), such as in clinical trials. Data verification helps to determine whether data was accurately translated when data is transferred from one source to another, is complete, and supports processes in the new system. During verification, there may be a need for a parallel run of both systems to identify areas of disparity and forestall erroneous data loss. Methods for data verification include double data entry, proofreading and automated verification of data. Proofreading data involves someone checking the data entered against the original document. This is also time-consuming and costly. Automated verification of data can be achieved using one way hashes locally or through use of a SaaS based service such as Q by SoLVBL to provide immutable seals to allow verification of the original data.
Cryptochannel
In telecommunications, a cryptochannel is a complete system of crypto-communications between two or more holders or parties. It includes: (a) the cryptographic aids prescribed; (b) the holders thereof; (c) the indicators or other means of identification; (d) the area or areas in which effective; (e) the special purpose, if any, for which provided; and (f) pertinent notes as to distribution, usage, etc. A cryptochannel is analogous to a radio circuit.
Data event
A data event is a relevant state transition defined in an event schema. Typically, event schemata are described by pre- and post condition for a single or a set of data items. In contrast to ECA (Event condition action), which considers an event to be a signal, the data event not only refers to the change (signal), but describes specific state transitions, which are referred to in ECA as conditions. Considering data events as relevant data item state transitions allows defining complex event-reaction schemata for a database. Defining data event schemata for relational databases is limited to attribute and instance events. Object-oriented databases also support collection properties, which allows defining changes in collections as data events, too.
Pepper (cryptography)
In cryptography, a pepper is a secret added to an input such as a password during hashing with a cryptographic hash function. This value differs from a salt in that it is not stored alongside a password hash, but rather the pepper is kept separate using another meachanism, such as a Hardware Security Module. Note that the National Institute of Standards and Technology refers to this value as a secret key rather than a pepper. A pepper is similar in concept to a salt or an encryption key. It is like a salt in that it is a randomized value that is added to a password hash, and it is similar to an encryption key in that it should be kept secret. A pepper performs a comparable role to a salt or an encryption key, but while a salt is not secret (merely unique) and can be stored alongside the hashed output, a pepper is secret and must not be stored with the output. The hash and salt are usually stored in a database, but, if stored, a pepper must be stored separately to prevent it from being obtained by the attacker in case of a database breach. == History == The idea of a site- or service-specific salt (in addition to a per-user salt) has a long history, with Steven M. Bellovin proposing a local parameter in a Bugtraq post in 1995. In 1996 Udi Manber also described the advantages of such a scheme, terming it a secret salt. However, he suggested not storing the value of the secret salt, but instead rediscovering it by trial and error at password verification time. The term pepper has been used, by analogy to salt, but with a variety of meanings. For example, when discussing a challenge-response scheme, pepper has been used for a salt-like quantity, though not used for password storage; it has been used for a data transmission technique where a pepper must be guessed; and even as a part of jokes. The term pepper was proposed for a secret or local parameter stored separately from the password in a discussion of protecting passwords from rainbow table attacks. This usage did not immediately catch on: for example, Fred Wenzel added support to Django password hashing for storage based on a combination of bcrypt and HMAC with separately stored nonces, without using the term. Usage has since become more common. == Types == There are multiple different types of pepper: A shared secret that is common to all users. A randomly-selected number that must be re-discovered on every password input. These mechanisms could be combined with password salting, iterated hashing or even one another. == Shared-secret pepper == Bellovin and Webster suggest prepend a shared secret to the password before hashing, which allows easy use of existing hash functions. For example, consider two users to be added to a database. This table contains two combinations of username and password. The password is not saved, and the 8-byte (64-bit) 44534C70C6883DE2 pepper is saved in a safe place separate from the output values of the hash, in this case SHA256. Unlike the salt, the pepper does not provide protection to users who use the same password, but protects against dictionary attacks, unless the attacker has the pepper value available. Since the same pepper is not shared between different applications, an attacker is unable to reuse the hashes of one compromised database to another. A complete scheme for saving passwords may include both salt and pepper use. For example, it has been suggested to combine the pepper by encrypting salted password hashes, which allows rotation of the pepper. In the case of a shared-secret pepper, a single compromised password (via password reuse or other attack) along with a user's salt can lead to an attack to discover the pepper, rendering it ineffective. If an attacker knows a plaintext password and a user's salt, as well as the algorithm used to hash the password, then discovering the pepper can be a matter of brute forcing the values of the pepper. This is why NIST recommends the secret value be at least 112 bits, so that discovering it by exhaustive search is prohibitively expensive. The pepper must be generated anew for every application it is deployed in, otherwise a breach of one application would result in lowered security of another application. Without knowledge of the pepper, other passwords in the database will be far more difficult to extract from their hashed values, as the attacker would need to guess the password as well as the pepper. A pepper adds security to a database of salts and hashes because unless the attacker is able to obtain the pepper, cracking even a single hash is intractable, no matter how weak the original password. Even with a list of (salt, hash) pairs, an attacker must also guess the secret pepper in order to find the password which produces the hash. The NIST specification for a secret salt suggests using a Password-Based Key Derivation Function (PBKDF) with an approved Pseudorandom Function such as HMAC with SHA-3 as the hash function of the HMAC. The NIST recommendation is also to perform at least 1000 iterations of the PBKDF, and a further minimum 1000 iterations using the secret salt in place of the non-secret salt. == Randomly-selected pepper that must be re-discovered == The aim of this mechanism is to slow down password the password verification step, thus slowing attacks. The aim is similar increasing the iteration count on bcrypt or Argon2, but the mechanism is different. The secret salt or pepper must be rediscovered by the verifier or attacker each time by guessing. In this situation, the password hashing function is calculated using both the password and the pepper. At password storage time, the pepper is chosen randomly from a range between 1 and R, the hash output is calculated using the password and the pepper. The hash output is stored with the username. The pepper is then discarded. At password verification time, the verifier is provided with a username and password to verify. The originally calculated hash is retrieved for the given username, and then the hash of the password and each value between 1 and R is calculated. If any of these hash values match the stored password hash, the password is considered valid. Note, the possible values of the pepper should not be tested in a fixed order known to an attacker, otherwise a timing attack may reveal the pepper. If the password is correct, the correct pepper will be found in R/2 hash evaluations on average. If the password is incorrect, all R values must be tested before the password can be rejected.