Robust principal component analysis

Robust principal component analysis

Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP). == Algorithms == === Non-convex method === The 2014 guaranteed algorithm for the robust PCA problem (with the input matrix being M = L + S {\displaystyle M=L+S} ) is an alternating minimization type algorithm. The computational complexity is O ( m n r 2 log ⁡ 1 ϵ ) {\displaystyle O\left(mnr^{2}\log {\frac {1}{\epsilon }}\right)} where the input is the superposition of a low-rank (of rank r {\displaystyle r} ) and a sparse matrix of dimension m × n {\displaystyle m\times n} and ϵ {\displaystyle \epsilon } is the desired accuracy of the recovered solution, i.e., ‖ L ^ − L ‖ F ≤ ϵ {\displaystyle \|{\widehat {L}}-L\|_{F}\leq \epsilon } where L {\displaystyle L} is the true low-rank component and L ^ {\displaystyle {\widehat {L}}} is the estimated or recovered low-rank component. Intuitively, this algorithm performs projections of the residual onto the set of low-rank matrices (via the SVD operation) and sparse matrices (via entry-wise hard thresholding) in an alternating manner - that is, low-rank projection of the difference the input matrix and the sparse matrix obtained at a given iteration followed by sparse projection of the difference of the input matrix and the low-rank matrix obtained in the previous step, and iterating the two steps until convergence. This alternating projections algorithm is later improved by an accelerated version, coined AccAltProj. The acceleration is achieved by applying a tangent space projection before projecting the residue onto the set of low-rank matrices. This trick improves the computational complexity to O ( m n r log ⁡ 1 ϵ ) {\displaystyle O\left(mnr\log {\frac {1}{\epsilon }}\right)} with a much smaller constant in front while it maintains the theoretically guaranteed linear convergence. Another fast version of accelerated alternating projections algorithm is IRCUR. It uses the structure of CUR decomposition in alternating projections framework to dramatically reduces the computational complexity of RPCA to O ( max { m , n } r 2 log ⁡ ( m ) log ⁡ ( n ) log ⁡ 1 ϵ ) {\displaystyle O\left(\max\{m,n\}r^{2}\log(m)\log(n)\log {\frac {1}{\epsilon }}\right)} === Convex relaxation === This method consists of relaxing the rank constraint r a n k ( L ) {\displaystyle rank(L)} in the optimization problem to the nuclear norm ‖ L ‖ ∗ {\displaystyle \|L\|_{}} and the sparsity constraint ‖ S ‖ 0 {\displaystyle \|S\|_{0}} to ℓ 1 {\displaystyle \ell _{1}} -norm ‖ S ‖ 1 {\displaystyle \|S\|_{1}} . The resulting program can be solved using methods such as the method of Augmented Lagrange Multipliers. === Deep-learning augmented method === Some recent works propose RPCA algorithms with learnable/training parameters. Such a learnable/trainable algorithm can be unfolded as a deep neural network whose parameters can be learned via machine learning techniques from a given dataset or problem distribution. The learned algorithm will have superior performance on the corresponding problem distribution. == Applications == RPCA has many real life important applications particularly when the data under study can naturally be modeled as a low-rank plus a sparse contribution. Following examples are inspired by contemporary challenges in computer science, and depending on the applications, either the low-rank component or the sparse component could be the object of interest: === Video surveillance === Given a sequence of surveillance video frames, it is often required to identify the activities that stand out from the background. If we stack the video frames as columns of a matrix M, then the low-rank component L0 naturally corresponds to the stationary background and the sparse component S0 captures the moving objects in the foreground. === Face recognition === Images of a convex, Lambertian surface under varying illuminations span a low-dimensional subspace. This is one of the reasons for effectiveness of low-dimensional models for imagery data. In particular, it is easy to approximate images of a human's face by a low-dimensional subspace. To be able to correctly retrieve this subspace is crucial in many applications such as face recognition and alignment. It turns out that RPCA can be applied successfully to this problem to exactly recover the face.

Zero-knowledge service

In cloud computing, the term zero-knowledge (or occasionally no-knowledge or zero-access) is a commonly used term for online services that store, transfer or manipulate data with a high level of confidentiality, where the data is only accessible to the data's owner (the client), and not to the service provider. However, unlike "end-to-end encryption", the term "zero-knowledge" does not imply any specific threat model or security notion, and its use is commonly frowned-upon by the security community. The term "zero-knowledge" was popularized by backup service SpiderOak, which later switched to using the term "no knowledge", acknowledging that the previous terminology was not technically accurate. == Disadvantages == Most cloud storage services keep a copy of the client's password on their servers, allowing clients who have lost their passwords to retrieve and decrypt their data using alternative means of authentication; but since zero-knowledge services do not store copies of clients' passwords, if a client loses their password then their data cannot be decrypted, making it practically unrecoverable. Most of the most used cloud storage services, such as Google Drive, Dropbox, OneDrive or iCloud, are also able to furnish access requests from law enforcement agencies for similar reasons; zero-knowledge services, however, are unable to do so, since their systems are designed to make clients' data inaccessible without the client's explicit cooperation.

Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. == Background == Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: Q ( w ) = 1 n ∑ i = 1 n Q i ( w ) , {\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),} where the parameter w {\displaystyle w} that minimizes Q ( w ) {\displaystyle Q(w)} is to be estimated. Each summand function Q i {\displaystyle Q_{i}} is typically associated with the i {\displaystyle i} -th observation in the data set (used for training). In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations). The sum-minimization problem also arises for empirical risk minimization. There, Q i ( w ) {\displaystyle Q_{i}(w)} is the value of the loss function at i {\displaystyle i} -th example, and Q ( w ) {\displaystyle Q(w)} is the empirical risk. When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: w := w − η ∇ Q ( w ) = w − η n ∑ i = 1 n ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q(w)=w-{\frac {\eta }{n}}\sum _{i=1}^{n}\nabla Q_{i}(w).} The step size is denoted by η {\displaystyle \eta } (sometimes called the learning rate in machine learning) and here " := {\displaystyle :=} " denotes the update of a variable in the algorithm. In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems. == Iterative method == In stochastic (or "on-line") gradient descent, the true gradient of Q ( w ) {\displaystyle Q(w)} is approximated by a gradient at a single sample: w := w − η ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q_{i}(w).} As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges. In pseudocode, stochastic gradient descent can be presented as : A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates η {\displaystyle \eta } decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins–Siegmund theorem. == Linear regression == Suppose we want to fit a straight line y ^ = w 1 + w 2 x {\displaystyle {\hat {y}}=w_{1}+w_{2}x} to a training set with observations ( ( x 1 , y 1 ) , ( x 2 , y 2 ) … , ( x n , y n ) ) {\displaystyle ((x_{1},y_{1}),(x_{2},y_{2})\ldots ,(x_{n},y_{n}))} and corresponding estimated responses ( y ^ 1 , y ^ 2 , … , y ^ n ) {\displaystyle ({\hat {y}}_{1},{\hat {y}}_{2},\ldots ,{\hat {y}}_{n})} using least squares. The objective function to be minimized is Q ( w ) = ∑ i = 1 n Q i ( w ) = ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( w 1 + w 2 x i − y i ) 2 . {\displaystyle Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y}}_{i}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.} The last line in the above pseudocode for this specific problem will become: [ w 1 w 2 ] ← [ w 1 w 2 ] − η [ ∂ ∂ w 1 ( w 1 + w 2 x i − y i ) 2 ∂ ∂ w 2 ( w 1 + w 2 x i − y i ) 2 ] = [ w 1 w 2 ] − η [ 2 ( w 1 + w 2 x i − y i ) 2 x i ( w 1 + w 2 x i − y i ) ] . {\displaystyle {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}\leftarrow {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.} Note that in each iteration or update step, the gradient is only evaluated at a single x i {\displaystyle x_{i}} . This is the key difference between stochastic gradient descent and batched gradient descent. In general, given a linear regression y ^ = ∑ k ∈ 1 : m w k x k {\displaystyle {\hat {y}}=\sum _{k\in 1:m}w_{k}x_{k}} problem, stochastic gradient descent behaves differently when m < n {\displaystyle m

Random indexing

Random indexing is a dimensionality reduction method and computational framework for distributional semantics, based on the insight that very-high-dimensional vector space model implementations are impractical, that models need not grow in dimensionality when new items (e.g. new terminology) are encountered, and that a high-dimensional model can be projected into a space of lower dimensionality without compromising L2 distance metrics if the resulting dimensions are chosen appropriately. This is the original point of the random projection approach to dimension reduction first formulated as the Johnson–Lindenstrauss lemma, and locality-sensitive hashing has some of the same starting points. Random indexing, as used in representation of language, originates from the work of Pentti Kanerva on sparse distributed memory, and can be described as an incremental formulation of a random projection. It can be also verified that random indexing is a random projection technique for the construction of Euclidean spaces—i.e. L2 normed vector spaces. In Euclidean spaces, random projections are elucidated using the Johnson–Lindenstrauss lemma. The TopSig technique extends the random indexing model to produce bit vectors for comparison with the Hamming distance similarity function. It is used for improving the performance of information retrieval and document clustering. In a similar line of research, Random Manhattan Integer Indexing (RMII) is proposed for improving the performance of the methods that employ the Manhattan distance between text units. Many random indexing methods primarily generate similarity from co-occurrence of items in a corpus. Reflexive Random Indexing (RRI) generates similarity from co-occurrence and from shared occurrence with other items.

Optical character recognition

Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printed data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components. == History == Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind. In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by IBM. === Visually impaired users === In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s.) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD-type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which eventually spun it off as Scansoft, which merged with Nuance Communications. In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone. With the advent of smartphones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR API to extract the text from the image file captured by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters. == Applications == OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents. The software can be used for: Entering data for business documents, e.g. checks, passports, invoices, bank statements and receipts Automatic number-plate recognition Passport recognition and information extraction in airports Automatically extracting key information from insurance documents Traffic-sign recognition Extracting business card information into a contact list Creating textual versions of printed documents, e.g. book scanning for Project Gutenberg Making electronic images of printed documents searchable, e.g. Google Books Converting handwriting in real-time to control a computer (pen computing) Defeating or testing the robustness of CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. Assistive technology for blind and visually impaired users Writing instructions for vehicles by identifying CAD images in a database that are appropriate to the vehicle design as it changes in real time Making scanned documents searchable by converting them to PDFs == Types == Optical character recognition (OCR) – targets typewritten text, one glyph or character at a time. Optical word recognition – targets typewritten text, one word at a time (for languages that use a space as a word divider). Usually just called "OCR". Intelligent character recognition (ICR) – also targets handwritten printscript or cursive text one glyph or character at a time, usually involving machine learning. Intelligent word recognition (IWR) – also targets handwritten printscript or cursive text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script. OCR is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to handwriting recognition. Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition". == Techniques == === Pre-processing === OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: De-skewing – if the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. Despeckling – removal of positive and negative spots, smoothing edges Binarization – conversion of an image from color or greyscale to black-and-white (called a binary image because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so. In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document, scene text image, degraded historical document, etc.). Line removal – Cleaning up non-glyph boxes and lines Layout analysis or zoning – Identification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables. Line and word detection – Establishment of a baseline for word and character shapes, separating words as necessary. Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. Character isolation or segmentation – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. Normalization of aspect ratio and scale Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace bet

Record sealing

Record sealing is the process of making public records inaccessible to the public. In many cases, a person with a sealed record gains the legal right to deny or not acknowledge anything to do with the arrest and the legal proceedings from the case itself. Records are commonly sealed in a number of situations: Sealed birth records (typically after adoption or determination of paternity) Juvenile criminal records may be sealed Other types of cases involving juveniles may be sealed, anonymized, or pseudonymized ("impounded"); e.g., child sex offense or custody cases Cases using witness protection information may be partly sealed Cases involving trade secrets Cases involving state secrets == Filing under seal in US court == Normally, records should not be filed under seal without a court permission. However, FRCP 5.2 requires that sensitive text – like Social Security number, Taxpayer Identification Number, birthday, bank accounts, and children’s names – should be redacted off the filings made with the court and accompanying exhibits. A person making a redacted filing can file an unredacted copy under seal, or the Court can choose to order later that an additional filing be made under seal without redaction. Alternately, the filing party may ask the court’s permission to file some exhibits completely under seal. When the document is filed "under seal", it should have a clear indication for the court clerk to file it separately – most often by stamping words "Filed Under Seal" on the bottom of each page. Person making filing should also provide instructions to the court clerk that the document needs to be filed "under seal". Courts often have specific requirements to these filings in their Local Rules. == Difference from expungement == Expungement, which is a physical destruction, namely a complete erasure of one's criminal records, and therefore usually carries a higher standard, differs from record sealing, which is only to restrict the public's access to records, so that only certain law enforcement agencies or courts, under special circumstances, will have access to them. A record seal will greatly improve the chance of employment, as employers will not have access to damning records. There are occasions, like expungement, where one can truthfully state under oath that they have never been convicted before. Most of the time, a record seal has more relaxed requirements than an expungement. If an expungement is not allowed with a case, then sealing a record may be the best bet. Different states have different terms for what constitutes sealing of a record. == Cybersecurity incidents involving sealed records == Several cybersecurity incidents have demonstrated that sealed court documents are not always secure in practice, with vulnerabilities and data breaches exposing sensitive information. In January 2021, following the SolarWinds cyber attack, the U.S. Bankruptcy Court United States District Court for the District of Nevada announced that its Case Management/Electronic Case Files CM/ECF system had been potentially compromised. The judiciary stated that additional safeguards were being implemented to protect filings, and that the review of the incident and its impact was ongoing. Reports noted that the breach raised concerns about exposure of highly sensitive and sealed documents submitted through the CM/ECF system. In 2023, security researcher Jason Parker, following a tip from an activist, identified flaws in online court systems that exposed sealed records including confidential testimony and medical records through publicly accessible portals. In 2024, a cyber intrusion targeting attorneys in a civil case involving Representative Matt Gaetz led to the unauthorized access and leak of sealed depositions and related records. The breach exposed confidential testimony and financial records, some of which were later reported by news outlets, raising concerns about the security of electronically stored legal materials and the handling of sealed filings. In 2025, multiple reports confirmed that the federal judiciary's CM/ECF and PACER (law) filing system was compromised, exposing sealed indictments, confidential informant information, and other sensitive filings. Some courts temporarily reverted to paper-based filing to mitigate the risks of further disclosure. The FBI later confirmed that the breach had exposed sealed records, and investigators suspected foreign state actors were involved. == GAO publications referencing sealed records == Closed Criminal Plea and Sentencing Proceedings (1983) – Reviewed Department of Justice policies on closing plea and sentencing hearings. GAO noted that sealed transcripts should be unsealed once the reasons for closure no longer applied. Information on Plea Agreements and Settlements in Defense Procurement Fraud Cases (1992) – Examined outcomes of procurement fraud prosecutions. GAO observed that in some instances the results were sealed from public access. Military Recruiting: More Needs to Be Done to Better Screen Applicants and Detect Fraud (1999) – Investigated fraudulent enlistments in the armed forces. The report highlighted that sealed juvenile records often prevented recruiters from discovering prior offenses. Social Security Numbers: Governments Could Do More to Reduce Display in Public Records (2004) – Analyzed risks associated with SSN availability in state and local records. GAO pointed out that some categories of records, such as adoption proceedings, were sealed and less likely to expose identifiers. Social Security Numbers: Stronger Safeguards Needed to Protect Privacy (2005 testimony) – Testimony before Congress reiterating concerns over SSN exposure in public records, while noting that sealed categories (e.g., adoption) were exceptions. U.S. Supreme Court: Policies and Perspectives on Video and Audio Coverage of Appellate Court Proceedings (2016) – Surveyed appellate court policies on courtroom media coverage. The report acknowledged distinctions between public filings, confidential submissions, and sealed materials. Evictions: National Data Are Limited and Challenging to Collect (2024) – Examined nationwide eviction data. GAO reported that in some states eviction records may be sealed or expunged, limiting researchers' ability to compile datasets. DOD Fraud Risk Management: Enhanced Data and Collaboration Could Improve Efforts (2024) – Reviewed Department of Defense fraud-risk management. GAO noted that some adjudicative records in its dataset were sealed, restricting completeness of oversight data.

VITAL (machine learning software)

VITAL (Validating Investment Tool for Advancing Life Sciences) was a Board Management Software machine learning proprietary software developed by Aging Analytics, a company registered in Bristol (England) and dissolved in 2017. Andrew Garazha (the firm's Senior Analyst) declared that the project aimed "through iterative releases and updates to create a piece of software capable of making autonomous investment decisions." According to Nick Dyer-Witheford, VITAL 1.0 was a "basic algorithm". On 13 May 2014, Deep Knowledge Ventures, a Hong Kong venture capital firm, claimed to have appointed VITAL to its board of directors in order to prove that artificial intelligence could be an instrument for investment decision-making. The announcement received great press coverage despite the fact commentators consider this a publicity stunt. Fortune reported in 2019 that VITAL is no longer used. == Criticism == Academics and journalists viewed VITAL's board appointment with skepticism. University of Sheffield computer science professor Noel Sharkey called it "a publicity hype". Michael Osborne, a University of Oxford associate professor in machine learning, found it is "a gimmick to call that an actual board member". Simon Sharwood of The Register, wrote there is "a strong whiff of stunt and/or promotion about this". In a 2019 speech, the Chief Scientist of Australia, Alan Finkel, commented, "At the time, most of us probably dismissed Vital as a PR exercise. I admit, I used her story three years ago to get a laugh in one of my speeches." Florian Möslein, a law professor at the University of Marburg, wrote in 2018 that "Vital has widely been acknowledged as the 'world's first artificial intelligence company director'". Vice journalist Jason Koebler suggested that the software did not have any article intelligence capabilities and concluded "VITAL can’t talk, and it can’t hear, and it can’t be a real, functional executive of a company." Sharwood of The Register noted that because VITAL was not a natural person, it could not be a board member under Hong Kong's corporate governance laws. However, in a 2017 interview to The Nikkei, Dmitry Kaminskiy, managing partner of Deep Knowledge Ventures, stated that VITAL had observer status on the board and no voting rights. University of Sheffield computer science professor Noel Sharkey said of VITAL, "On first sight, it looks like a futuristic idea but on reflection it is really a little bit of publicity hype." Vice journalist Jason Koebler said "this is a gimmick" and said "There is literally nothing to suggest that VITAL has any sort of capabilities beyond any other proprietary analysis software". Michael Osborne, a University of Oxford associate professor in machine learning, found VITAL's appointment to be noncredible, saying it is "a bit of a gimmick to call that an actual board member". Osborne said that a core duty of board members to converse with each other, which the algorithm is incapable of doing, so its more likely functionality is to serve as a springboard for conversation among other board members. In a 2019 speech, the Chief Scientist of Australia, Alan Finkel, commented, "At the time, most of us probably dismissed Vital as a PR exercise. I admit, I used her story three years ago to get a laugh in one of my speeches." == Machine intelligence as board member == VITAL was created by a group of programmers employed by Aging Analytics According to Andrew Garazh, Aging Analytics Senior Analyst, VITAL was not a machine learning algorithm as the necessary datasets on investment rounds, intellectual property and clinical trial outcomes are generally not disclosed. Rather, VITAL used fuzzy logic based on 50 parameters to assess risk factors. Aging Analytics licensed the software to Deep Knowledge Ventures. It was used to help the human board members of Deep Knowledge Venture make investment decisions in biotechnology companies. For instance, it supported investments in Insilico Medicine, which creates ways for computers to help find drugs in research into aging. VITAL also supported investing in Pathway Pharmaceuticals, which uses the OncoFinder algorithm to choose and appraise cancer treatments. According to Dmitry Kaminskiy, managing partner of Deep Knowledge Ventures, the motivation for using VITAL was the large number of failed investments in the biotechnology sector and the desire to avoid investing in companies likely to fail. == Ethical and legal implications == Scholars addressed questions around the safety, privacy, accountability transparency and bias in algorithms. Writing in the philosophical journal Multitudes, the academic Ariel Kyrou raised questions about the consequences of a mistake made by an algorithm recommending a dangerous investment. He raised the hypothetical where VITAL was able to persuade the board to invest in a startup that had the facade of doing research into treatment for age-associated ills, but in actuality was run by terrorists who were raising funds. Kyrou raised a series of questions about who society would fault for VITAL's mistake. As the owner of VITAL, should Deep Knowledge Ventures be held accountable, or rather should the companies that supplied data to VITAL or the people who created VITAL be held liable? Simon Sharwood of The Register wrote that because the appointment of a software program to the board directors is not legally feasible in Hong Kong, there is "a strong whiff of stunt and/or promotion about this". Quoting a Thomson Reuters website describing Hong Kong legislation related to corporate governance, Sharwood pointed out that in Hong Kong "the board comprises all of the directors of the company" and "a director must normally be a natural person, except that a private company may have a body corporate as its director if the company is not a member of a listed group." He concluded that since VITAL cannot be considered a "natural person", it is merely a "cosmetic" appointment to the board and that "this software is no more a Board member than Caligula's horse was a senator". Sharwood further argued that corporations frequently purchase directors and officers liability insurance but that it would be practically impossible to get such insurance for VITAL. Sharwood also wrote that were VITAL to be hacked, any misinformation it outputs could be considered "false and misleading communications". In the book Research Handbook on the Law of Artificial Intelligence, Florian Mölein wrote that VITAL could not become a director as defined in Hong Kong's corporate laws, so the other directors just were approaching it as "a member of [the] board with observer status". Lin Shaowei raised concerns in a Journal of East China University of Political Science and Law article about how the software's appearance inspired a complex question about the relationship between corporate law and artificial intelligence. VITAL could be considered either a board director who has voting rights or an observer who does not. Lin said either choice raised questions about whether VITAL is subject to corporate law and who would be held accountable if VITAL recommends a choice that turns out to be damaging to the company. David Theo Goldberg in the Critical Times, a peer reviewed journal in Critical Global Theory, argues that VITAL processed a dataset to predict the most remunerative investment opportunities. Drawing his analysis on an article from Business Insider, Goldberg describes VITAL's decision-making predictiveness based "on surface pattern recognition and the identification of regularities and/or irregularities". In other words, Goldberg asserts that "the normativity of the surface" explains algorithmic knowledge of a "product" like VITAL. In Homo Deus, Yuval Noah Harari mentions VITAL as an example of the future risks that humankind faces. Harari argues that the human mind is being replaced by a world in which algorithms and data make the decisions. Specifically, it is argued that "as algorithms push humans out of the job market," executive boards driven by artificial intelligence are more likely to give priority to algorithms over the humans.