Jian Ma (computational biologist)

Jian Ma (computational biologist)

Jian Ma (Chinese: 马坚) is an American computer scientist and computational biologist. He is the Ray and Stephanie Lane Professor of Computational Biology in the School of Computer Science at Carnegie Mellon University. He is a faculty member in the Ray and Stephanie Lane Computational Biology Department. His lab develops AI/ML methods to study the structure and function of the human genome and cellular organization and their implications for health and disease. During his Ph.D. and postdoc training, he developed algorithms to reconstruct the ancestral mammalian genome and evolutionary history. His research group has recently pioneered a series of new machine learning solutions for 3D genome organization, single-cell epigenomics, spatial omics, and complex molecular interactions. His lab also explores large language models to uncover gene regulatory mechanisms and the intricate connections among cellular components, with the aim of driving discovery and guiding experimentation. He received an NSF CAREER award in 2011. In 2020, he was awarded a Guggenheim Fellowship in Computer Science. He received the Allen Newell Award for Research Excellence (2025). He is an elected Fellow of the American Association for the Advancement of Science, the American Institute for Medical and Biological Engineering, the International Society for Computational Biology, and the Association for Computing Machinery. He leads an NIH 4D Nucleome Center to develop machine learning algorithms to better understand the cell nucleus. He served as the Program Chair for RECOMB 2024. He is also a member of the Scientific Advisory Board of the Chan Zuckerberg Biohub Chicago (CZ Biohub Chicago) and the RECOMB Steering Committee. In 2024, he launched the Center for AI-Driven Biomedical Research (AI4BIO) at CMU, which will be a catalyst for innovations at the intersection of AI and biomedicine across the School of Computer Science and campus. == Selected Recent Publications == Chen V#, Yang M#, Cui W, Kim JS, Talwalkar A, and Ma J. Applying interpretable machine learning in computational biology - pitfalls, recommendations and opportunities for new developments. Nature Methods, 21(8):1454-1461, 2024. Xiong K#, Zhang R#, and Ma J. scGHOST: Identifying single-cell 3D genome subcompartments. Nature Methods, 21(5):814-822, 2024. Zhou T, Zhang R, Jia D, Doty RT, Munday AD, Gao D, Xin L, Abkowitz JL, Duan Z, and Ma J. GAGE-seq concurrently profiles multiscale 3D genome organization and gene expression in single cells. Nature Genetics, 56(8):1701-1711, 2024. Zhang Y, Boninsegna L, Yang M, Misteli T, Alber F, and Ma J. Computational methods for analysing multiscale 3D genome organization. Nature Reviews Genetics, 5(2):123-141, 2024. Chidester B#, Zhou T#, Alam S, and Ma J. SPICEMIX enables integrative single-cell spatial modeling of cell identity. Nature Genetics, 55(1):78-88, 2023. [Cover Article] Zhang R#, Zhou T#, and Ma J. Ultrafast and interpretable single-cell 3D genome analysis with Fast-Higashi. Cell Systems, 13(10):P798-807.E6, 2022. [Cover Article] Zhu X#, Zhang Y#, Wang Y, Tian D, Belmont AS, Swedlow JR, and Ma J. Nucleome Browser: An integrative and multimodal data navigation platform for 4D Nucleome. Nature Methods, 19(8):911-913, 2022. Zhang R, Zhou T, and Ma J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nature Biotechnology, 40:254–261, 2022.

30 Boxes

30 Boxes is a minimalist calendaring IOS application created by 83 Degrees. Originating as a web application in March 2006, 30 Boxes was founded by Webshots cofounder Narendra Rocherolle. The website shut down some time in 2020, but relaunched for the IOS in February 2021. The original website was tailored towards "social media junkies". == Reception == Barry Collins of The Sunday Times appreciated the website's plain-language event adding feature, but did not appreciate that he was unable to see more than one month of events at a time. Collins was also unhappy that the website was not capable of warning him when he had two events scheduled at the same time. In a list of the best web-based calendar software for small businesses, Forbes ranked 30 Boxes second, after Google Calendar. They described 30 Boxes like “buying a new car with manual transmission and lots of extras—you don't just want to drive it, you want to fool around with it to see what it can do”.

Vector-field consistency

Vector-Field Consistency is a consistency model for replicated data (for example, objects), initially described in a paper which was awarded the best-paper prize in the ACM/IFIP/Usenix Middleware Conference 2007. It has since been enhanced for increased scalability and fault-tolerance in a recent paper. == Description == This consistency model was initially designed for replicated data management in ad hoc gaming in order to minimize bandwidth usage without sacrificing playability. Intuitively, it captures the notion that although players require, wish, and take advantage of information regarding the whole of the game world (as opposed to a restricted view to rooms, arenas, etc. of limited size employed in many multiplayer video games), they need to know information with greater freshness, frequency, and accuracy as other game entities are located closer and closer to the player's position. It prescribes a multidimensional divergence bounding scheme, based on a vector field that employs consistency vectors k=(θ,σ,ν), standing for maximum allowed time - or replica staleness, sequence - or missing updates, and value - or user-defined measured replica divergence, applied to all space coordinates in game scenario or world. The consistency vector-fields emanate from field-generators designated as pivots (for example, players) and field intensity attenuates as distance grows from these pivots in concentric or square-like regions. This consistency model unifies locality-awareness techniques employed in message routing and consistency enforcement for multiplayer games, with divergence bounding techniques traditionally employed in replicated database and web scenarios.

Kunstweg

Bürgi's Kunstweg is a set of algorithms developed by Jost Bürgi in the late 16th century. They are used to calculate sines to arbitrary precision.. Bürgi used these algorithms to calculate a Canon Sinuum, a sine table in increments of 2 arc seconds. It is believed that the table featured values accurate to eight sexagesimal places. Some authors have speculated that the table only covered the range from 0° to 45°, although there is no evidence supporting this claim. Such tables were crucial for maritime navigation. Johannes Kepler described the Canon Sinuum as the most precise sine table known at the time. Bürgi explained his algorithms in his work Fundamentum Astronomiae, which he presented to Emperor Rudolf II in 1592. The Kunstweg algorithm calculates sine values iteratively. In each step, the value of a cell is the sum of the two preceding cells in the same column. The final cell's value is halved before beginning the next iteration. Ultimately, the values in the last column are normalized. Accurate sine approximations are achieved after only a few iterations. In 2015, Menso Folkerts and coworkers demonstrated that this iterative process does indeed converge toward the true sine values. According to them this was the first step towards differential calculus.

Knowledge organization

Knowledge organization (KO), organization of knowledge, organization of information, or information organization is an intellectual discipline concerned with activities such as document description, indexing, and classification that serve to provide systems of representation and order for knowledge and information objects. According to The Organization of Information by Joudrey and Taylor, information organization: examines the activities carried out and tools used by people who work in places that accumulate information resources (e.g., books, maps, documents, datasets, images) for the use of humankind, both immediately and for posterity. It discusses the processes that are in place to make resources findable, whether someone is searching for a single known item or is browsing through hundreds of resources just hoping to discover something useful. Information organization supports a myriad of information-seeking scenarios. Issues related to knowledge sharing can be said to have been an important part of knowledge management for a long time. Knowledge sharing has received a lot of attention in research and business practice both within and outside organizations and its different levels. Sharing knowledge is not only about giving it to others, but it also includes searching, locating, and absorbing knowledge. Unawareness of the employees' work and duties tends to provoke the repetition of mistakes, the waste of resources, and duplication of the same projects. Motivating co-workers to share their knowledge is called knowledge enabling. It leads to trust among individuals and encourages a more open and proactive relationship that grants the exchange of information easily. Knowledge sharing is part of the three-phase knowledge management process which is a continuous process model. The three parts are knowledge creation, knowledge implementation, and knowledge sharing. The process is continuous, which is why the parts cannot be fully separated. Knowledge creation is the consequence of individuals' minds, interactions, and activities. Developing new ideas and arrangements alludes to the process of knowledge creation. Using the knowledge which is present at the company in the most effective manner stands for the implementation of knowledge. Knowledge sharing, the most essential part of the process for our topic, takes place when two or more people benefit by learning from each other. Traditional human-based approaches performed by librarians, archivists, and subject specialists are increasingly challenged by computational (big data) algorithmic techniques. KO as a field of study is concerned with the nature and quality of such knowledge-organizing processes (KOP) (such as taxonomy and ontology) as well as the resulting knowledge organizing systems (KOS). == Theoretical approaches == === Traditional approaches === Among the major figures in the history of KO are Melvil Dewey (1851–1931) and Henry Bliss (1870–1955). Dewey's goal was an efficient way to manage library collections; not an optimal system to support users of libraries. His system was meant to be used in many libraries as a standardized way to manage collections. The first version of this system was created in 1876. An important characteristic in Henry Bliss' (and many contemporary thinkers of KO) was that the sciences tend to reflect the order of Nature and that library classification should reflect the order of knowledge as uncovered by science: The implication is that librarians, in order to classify books, should know about scientific developments. This should also be reflected in their education: Again from the standpoint of the higher education of librarians, the teaching of systems of classification ... would be perhaps better conducted by including courses in the systematic encyclopedia and methodology of all the sciences, that is to say, outlines which try to summarize the most recent results in the relation to one another in which they are now studied together. ... (Ernest Cushing Richardson, quoted from Bliss, 1935, p. 2) Among the other principles, which may be attributed to the traditional approach to KO are: Principle of controlled vocabulary Cutter's rule about specificity Hulme's principle of literary warrant (1911) Principle of organizing from the general to the specific Today, after more than 100 years of research and development in LIS, the "traditional" approach still has a strong position in KO and in many ways its principles still dominate. === Facet analytic approaches === The date of the foundation of this approach may be chosen as the publication of S. R. Ranganathan's colon classification in 1933. The approach has been further developed by, in particular, the British Classification Research Group. The best way to explain this approach is probably to explain its analytico-synthetic methodology. The meaning of the term "analysis" is: breaking down each subject into its basic concepts. The meaning of the term synthesis is: combining the relevant units and concepts to describe the subject matter of the information package in hand. Given subjects (as they appear in, for example, book titles) are first analyzed into a few common categories, which are termed "facets". Ranganathan proposed his PMEST formula: Personality, Matter, Energy, Space and Time: Personality is the distinguishing characteristic of a subject. Matter is the physical material of which a subject may be composed. Energy is any action that occurs with respect to the subject. Space is the geographic component of the location of a subject. Time is the period associated with a subject. === The information retrieval tradition (IR) === Important in the IR-tradition have been, among others, the Cranfield experiments, which were founded in the 1950s, and the TREC experiments (Text Retrieval Conferences) starting in 1992. It was the Cranfield experiments, which introduced the measures "recall" and "precision" as evaluation criteria for systems efficiency. The Cranfield experiments found that classification systems like UDC and facet-analytic systems were less efficient compared to free-text searches or low level indexing systems ("UNITERM"). The Cranfield I test found, according to Ellis (1996, 3–6) the following results: Although these results have been criticized and questioned, the IR-tradition became much more influential while library classification research lost influence. The dominant trend has been to regard only statistical averages. What has largely been neglected is to ask: Are there certain kinds of questions in relation to which other kinds of representation, for example, controlled vocabularies, may improve recall and precision? === User-oriented and cognitive views === The best way to define this approach is probably by method: Systems based upon user-oriented approaches must specify how the design of a system is made on the basis of empirical studies of users. User studies demonstrated very early that users prefer verbal search systems as opposed to systems based on classification notations. This is one example of a principle derived from empirical studies of users. Adherents of classification notations may, of course, still have an argument: That notations are well-defined and that users may miss important information by not considering them. Folksonomies is a recent kind of KO based on users' rather than on librarians' or subject specialists' indexing. === Bibliometric approaches === These approaches are primarily based on using bibliographical references to organize networks of papers, mainly by bibliographic coupling (introduced by Kessler 1963) or co-citation analysis ( independently suggested by Marshakova 1973 and Small 1973). In recent years it has become a popular activity to construe bibliometric maps as structures of research fields. Two considerations are important in considering bibliometric approaches to KO: The level of indexing depth is partly determined by the number of terms assigned to each document. In citation indexing this corresponds to the number of references in a given paper. On the average, scientific papers contain 10–15 references, which provide quite a high level of depth. The references, which function as access points, are provided by the highest subject-expertise: The experts writing in the leading journals. This expertise is much higher than that which library catalogs or bibliographical databases typically are able to draw on. === The domain analytic approach === Domain analysis is a sociological-epistemological standpoint that advocates that the indexing of a given document should reflect the needs of a given group of users or a given ideal purpose. In other words, any description or representation of a given document is more or less suited to the fulfillment of certain tasks. A description is never objective or neutral, and the goal is not to standardize descriptions or make one description once and for all for different target groups. The develo

MLOps

MLOps or ML Ops is a paradigm that aims to deploy and maintain machine learning models in production reliably and efficiently. It bridges the gap between machine learning development and production operations, ensuring that models are robust, scalable, and aligned with business goals. The word is a compound of "machine learning" and the continuous delivery practice (CI/CD) of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between data scientists, DevOps, and machine learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle - from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics. == Definition == MLOps is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops. == History == Interest in operationalizing machine learning systems began to grow in the mid-2010s as ML projects started moving from experimentation to production use. The challenges associated with sustaining such systems were highlighted in a 2015 paper. The predicted growth in machine learning included an estimated doubling of ML pilots and implementations from 2017 to 2018, and again from 2018 to 2020. Reports show a majority (up to 88%) of corporate machine learning initiatives are struggling to move beyond test stages. However, those organizations that actually put machine learning into production saw a 3–15% profit margin increases. The MLOps market size was USD 2,191.8 Million in 2024, and is projected to be USD 16,613.4 Million in 2030. == Architecture == Machine Learning systems can be categorized in eight different categories: data collection, data processing, feature engineering, data labeling, model design, model training and optimization, endpoint deployment, and endpoint monitoring. Each step in the machine learning lifecycle is built in its own system, but requires interconnection. These are the minimum systems that enterprises need to scale machine learning within their organization. == Goals == There are a number of goals enterprises want to achieve through MLOps systems successfully implementing ML across the enterprise, including: Deployment and automation Reproducibility of models and predictions Diagnostics Governance and regulatory compliance Scalability Collaboration Business uses Monitoring and management A standard practice, such as MLOps, takes into account each of the aforementioned areas, which can help enterprises optimize workflows and avoid issues during implementation. Vendors such as Adaptive ML deliver commercial reinforcement learning operations (RLOps) and MLOps-infrastructure, targeting organizations deploying large language models in production. A common architecture of an MLOps system would include data science platforms where models are constructed and the analytical engines where computations are performed, with the MLOps tool orchestrating the movement of machine learning models, data and outcomes between the systems.

Run-time algorithm specialization

In computer science, run-time algorithm specialization is a methodology for creating efficient algorithms for costly computation tasks of certain kinds. The methodology originates in the field of automated theorem proving and, more specifically, in the Vampire theorem prover project. The idea is inspired by the use of partial evaluation in optimising program translation. Many core operations in theorem provers exhibit the following pattern. Suppose that we need to execute some algorithm a l g ( A , B ) {\displaystyle {\mathit {alg}}(A,B)} in a situation where a value of A {\displaystyle A} is fixed for potentially many different values of B {\displaystyle B} . In order to do this efficiently, we can try to find a specialization of a l g {\displaystyle {\mathit {alg}}} for every fixed A {\displaystyle A} , i.e., such an algorithm a l g A {\displaystyle {\mathit {alg}}_{A}} , that executing a l g A ( B ) {\displaystyle {\mathit {alg}}_{A}(B)} is equivalent to executing a l g ( A , B ) {\displaystyle {\mathit {alg}}(A,B)} . The specialized algorithm may be more efficient than the generic one, since it can exploit some particular properties of the fixed value A {\displaystyle A} . Typically, a l g A ( B ) {\displaystyle {\mathit {alg}}_{A}(B)} can avoid some operations that a l g ( A , B ) {\displaystyle {\mathit {alg}}(A,B)} would have to perform, if they are known to be redundant for this particular parameter A {\displaystyle A} . In particular, we can often identify some tests that are true or false for A {\displaystyle A} , unroll loops and recursion, etc. == Difference from partial evaluation == The key difference between run-time specialization and partial evaluation is that the values of A {\displaystyle A} on which a l g {\displaystyle {\mathit {alg}}} is specialised are not known statically, so the specialization takes place at run-time. There is also an important technical difference. Partial evaluation is applied to algorithms explicitly represented as codes in some programming language. At run-time, we do not need any concrete representation of a l g {\displaystyle {\mathit {alg}}} . We only have to imagine a l g {\displaystyle {\mathit {alg}}} when we program the specialization procedure. All we need is a concrete representation of the specialized version a l g A {\displaystyle {\mathit {alg}}_{A}} . This also means that we cannot use any universal methods for specializing algorithms, which is usually the case with partial evaluation. Instead, we have to program a specialization procedure for every particular algorithm a l g {\displaystyle {\mathit {alg}}} . An important advantage of doing so is that we can use some powerful ad hoc tricks exploiting peculiarities of a l g {\displaystyle {\mathit {alg}}} and the representation of A {\displaystyle A} and B {\displaystyle B} , which are beyond the reach of any universal specialization methods. == Specialization with compilation == The specialized algorithm has to be represented in a form that can be interpreted. In many situations, usually when a l g A ( B ) {\displaystyle {\mathit {alg}}_{A}(B)} is to be computed on many values of B {\displaystyle B} in a row, a l g A {\displaystyle {\mathit {alg}}_{A}} can be written as machine code instructions for a special abstract machine, and it is typically said that A {\displaystyle A} is compiled. The code itself can then be additionally optimized by answer-preserving transformations that rely only on the semantics of instructions of the abstract machine. The instructions of the abstract machine can usually be represented as records. One field of such a record, an instruction identifier (or instruction tag), would identify the instruction type, e.g. an integer field may be used, with particular integer values corresponding to particular instructions. Other fields may be used for storing additional parameters of the instruction, e.g. a pointer field may point to another instruction representing a label, if the semantics of the instruction require a jump. All instructions of the code can be stored in a traversable data structure such as an array, linked list, or tree. Interpretation (or execution) proceeds by fetching instructions in some order, identifying their type, and executing the actions associated with said type. In many programming languages, such as C and C++, a simple switch statement may be used to associate actions with different instruction identifiers. Modern compilers usually compile a switch statement with constant (e.g. integer) labels from a narrow range by storing the address of the statement corresponding to a value i {\displaystyle i} in the i {\displaystyle i} -th cell of a special array, as a means of efficient optimisation. This can be exploited by taking values for instruction identifiers from a small interval of values. == Data-and-algorithm specialization == There are situations when many instances of A {\displaystyle A} are intended for long-term storage and the calls of a l g ( A , B ) {\displaystyle {\mathit {alg}}(A,B)} occur with different B {\displaystyle B} in an unpredictable order. For example, we may have to check a l g ( A 1 , B 1 ) {\displaystyle {\mathit {alg}}(A_{1},B_{1})} first, then a l g ( A 2 , B 2 ) {\displaystyle {\mathit {alg}}(A_{2},B_{2})} , then a l g ( A 1 , B 3 ) {\displaystyle {\mathit {alg}}(A_{1},B_{3})} , and so on. In such circumstances, full-scale specialization with compilation may not be suitable due to excessive memory usage. However, we can sometimes find a compact specialized representation A ′ {\displaystyle A^{\prime }} for every A {\displaystyle A} , that can be stored with, or instead of, A {\displaystyle A} . We also define a variant a l g ′ {\displaystyle {\mathit {alg}}^{\prime }} that works on this representation and any call to a l g ( A , B ) {\displaystyle {\mathit {alg}}(A,B)} is replaced by a l g ′ ( A ′ , B ) {\displaystyle {\mathit {alg}}^{\prime }(A^{\prime },B)} , intended to do the same job faster.