Marine Carpuat

Marine Carpuat

Marine Carpuat is a computer scientist who works on machine translation and natural language processing. She is known for her research connecting cross-lingual semantics with machine translation. She has been recognized with a NSF Career Award in 2018, a Google Research award in 2016, and Amazon Faculty Awards in 2016 and 2018. == Education == Marine Carpuat obtained her MPhil and PhD from Hong Kong University of Science and Technology in 2008 under the supervision of Dekai Wu. Her PhD thesis was on the topic of machine translation, and demonstrated the first results showing that explicit modeling of lexical semantics could improve the accuracy of a machine translation system. == Career == After completing her education, Carpuat worked at the National Research Council Canada as a researcher. In 2015, she joined University of Maryland as an assistant professor in Computer Science where she is a member of the CLIP lab. Carpuat works in the area of natural language processing with a focus on machine translation and cross-lingual semantics. She has published over 100 peer-reviewed research papers. Her work is published in the proceedings of computer science conferences, including the Annual Meeting of the Association for Computational Linguistics and Empirical Methods in Natural Language Processing. == Selected honors and distinctions == 2016 Google Research Award 2016, 2018 Amazon Research Awards 2018 NSF Career Award

Automatic acquisition of lexicon

Automatic acquisition of lexicon is a computerized process used for the development of a complex morphological lexicon of a language. The lexicon is essential for the NLP (Natural language processing), as well as a prerequisite to any wide-coverage parser. The two main requirements represent raw corpus and the morphological description of the language. The aim is to provide lemmas that will serve to the explanation of all the words that occur within the corpus. For the achievement of a quality lexicon it is necessary to manually validate the generated lemmas and iterate the whole process several times. The process is focused on the open word classes (e.g. nouns, adjectives, verbs). Closed classes (e.g. prepositions, pronouns, numerals) are excluded. This method is applicable to the languages with a rich morphology, such as Slovak, Russian or Croatian. Applied to Slovak, being an inflectional language, the automatic acquisition focuses on the inflectional morphology as well as on the derivational morphology. This fact enables the users to find out the information about derivational relations (e.g. adjectivizations, prefixes) in the lexicon. For example, Slovak word korpusový is an adjectivization of korpus (eng. corpus). == Three-step loop == Conformably to Benoît Sagot, there are three stages involved in the acquisition of lemmas: Generation and inflection Ranking Manual validation The more iteration will be performed, the more accurate lexicon will be obtained. For each iteration are essential the information given by a manual validator. === Generation and inflection === Firstly, all words which represent the closed word classes (pronouns, prepositions, numerals) are manually excluded from the given corpus. Number of their occurrences in the corpus is provided. Then the automatic generation comes, when the hypothetical lemmas according to the morphological description of a language are created. Generated lemmas are consequently being inflected, so that all of their inflected forms are built. Obtained forms are associated with the corresponding lemma and a morphological tag. === Ranking === There was created a probabilistic model, represented by a fix-point algorithm, to rank the hypothetical lemmas generated in the first step. Best ranked lemmas are expected to be ideally all correct, whereas the least ranked tend to be incorrect. === Manual validation === Correctness of the best- ranked lemmas created in the previous step are checked by the manual validator, who should be a native speaker. Lemmas are at this stage divided into three categories: valid lemmas, appended to lexicon erroneous lemmas generated by valid forms (later associated to another lemmas) erroneous lemmas generated by invalid forms (these need to be excluded) == Future development == Automatic acquisition, in comparison to a purely manual development of the lexicons, seems to be promising, considering the future development, because of the short validation time needed and the relatively small amount of human labor involved.

Software configuration management

Software configuration management (SCM), a.k.a. software change and configuration management (SCCM), is the software engineering practice of tracking and controlling changes to a software system. It is part of the larger cross-disciplinary field of configuration management (CM). SCM includes version control and the establishment of baselines. == Goals == The goals of SCM include: Configuration identification - Identifying configurations, configuration items and baselines. Configuration control - Implementing a controlled change process. This is usually achieved by setting up a change control board whose primary function is to approve or reject all change requests that are sent against any baseline. Configuration status accounting - Recording and reporting all the necessary information on the status of the development process. Configuration auditing - Ensuring that configurations contain all their intended parts and are sound with respect to their specifying documents, including requirements, architectural specifications and user manuals. Build management - Managing the process and tools used for builds. Process management - Ensuring adherence to the organization's development process. Environment management - Managing the software and hardware that host the system. Teamwork - Facilitate team interactions related to the process. Defect tracking - Making sure every defect has traceability back to the source. With the introduction of cloud computing and DevOps the purposes of SCM tools have become merged in some cases. The SCM tools themselves have become virtual appliances that can be instantiated as virtual machines and saved with state and version. The tools can model and manage cloud-based virtual resources, including virtual appliances, storage units, and software bundles. The roles and responsibilities of the actors have become merged as well with developers now being able to dynamically instantiate virtual servers and related resources. == History == == Examples == Ansible – Open-source software platform for remote configuring and managing computers CFEngine – Configuration management software Chef – Configuration management toolPages displaying short descriptions of redirect targets LCFG – Computer configuration management system NixOS – Linux distribution OpenMake Software – DevOps company Otter Puppet – Open source configuration management software Salt – Configuration management software Rex – Open source software

Flat-field correction

Flat-field correction (FFC) is a digital imaging technique to mitigate pixel-to-pixel differences in the photodetector sensitivity and distortions in the optical path. It is a standard calibration procedure in everything from personal digital cameras to large telescopes. == Overview == Flat fielding refers to the process of compensating for different gains and dark currents in a detector. Once a detector has been appropriately flat-fielded, a uniform signal will create a uniform output (hence flat-field). This then means any further signal is due to the phenomenon being detected and not a systematic error. A flat-field image is acquired by imaging a uniformly-illuminated screen, thus producing an image of uniform color and brightness across the frame. For handheld cameras, the screen could be a piece of paper at arm's length, but a telescope will frequently image a clear patch of sky at twilight, when the illumination is uniform and there are few, if any, stars visible. Once the images are acquired, processing can begin. A flat-field consists of two numbers for each pixel, the pixel's gain and its dark current (or dark frame). The pixel's gain is how the amount of signal given by the detector varies as a function of the amount of light (or equivalent). The gain is almost always a linear variable, as such the gain is given simply as the ratio of the input and output signals. The dark-current is the amount of signal given out by the detector when there is no incident light (hence dark frame). In many detectors this can also be a function of time, for example in astronomical telescopes it is common to take a dark-frame of the same time as the planned light exposure. The gain and dark-frame for optical systems can also be established by using a series of neutral density filters to give input/output signal information and applying a least squares fit to obtain the values for the dark current and gain. C = ( R − D ) × m ( F − D ) = ( R − D ) × G {\displaystyle C={\frac {(R-D)\times m}{(F-D)}}=(R-D)\times G} where: C = corrected image R = raw image F = flat field image D = dark frame image m = image-averaged value of (F−D) G = Gain = m ( F − D ) {\displaystyle m \over (F-D)} In this equation, capital letters are 2D matrices, and lowercase letters are scalars. All matrix operations are performed element-by-element. In order for an astrophotographer to capture a light frame, they must place a light source over the imaging instrument's objective lens such that the light source emanates evenly through the users optics. The photographer must then adjust the exposure of their imaging device (charge-coupled device (CCD) or digital single-lens reflex camera (DSLR) ) so that when the histogram of the image is viewed, a peak reaching about 40–70% of the dynamic range (maximum range of pixel values) of the imaging device is seen. The photographer typically takes 15–20 light frames and performs median stacking. Once the desired light frames are acquired, the objective lens is covered so that no light is allowed in, then 15–20 dark frames are taken, each of equal exposure time as a light frame. These are called Dark-Flat frames. == In X-ray imaging == In X-ray imaging, the acquired projection images generally suffer from fixed-pattern noise, which is one of the limiting factors of image quality. It may stem from beam inhomogeneity, gain variations of the detector response due to inhomogeneities in the photon conversion yield, losses in charge transport, charge trapping, or variations in the performance of the readout. Also, the scintillator screen may accumulate dust and/or scratches on its surface, resulting in systematic patterns in every acquired X-ray projection image. In X-ray computed tomography (CT), fixed-pattern noise is known to significantly degrade the achievable spatial resolution and generally leads to ring or band artifacts in the reconstructed images. Fixed pattern noise can be easily removed using flat field correction. In conventional flat field correction, projection images without sample are acquired with and without the X-ray beam turned on, which are referred to as flat fields (F) and dark fields (D). Based on the acquired flat and dark fields, the measured projection images (P) with sample are then normalized to new images (N) according to: N = ( P − D ) ( F − D ) {\displaystyle N={\frac {(P-D)}{(F-D)}}} == Dynamic flat field correction == While conventional flat field correction is an elegant and easy procedure that largely reduces fixed-pattern noise, it heavily relies on the stationarity of the X-ray beam, scintillator response and CCD sensitivity. In practice, however, this assumption is only approximately met. Indeed, detector elements are characterized by intensity dependent, nonlinear response functions and the incident beam often shows time dependent non-uniformities, which render conventional FFC inadequate. In synchrotron X-ray tomography, many factors may cause flat field variations: instability of the bending magnets of the synchrotron, temperature variations due to the water cooling in mirrors and the monochromator, or vibrations of the scintillator and other beamline components. The latter is responsible for the biggest variations in the flat fields. To deal with such variations, a dynamic flat field correction procedure can be employed that estimates a flat field for each individual projection. Through principal component analysis of a set of flat fields, which are acquired prior and/or posterior to the actual scan, eigen flat fields can be computed. A linear combination of the most important eigen flat fields can then be used to individually normalize each X-ray projection: N j = P j − D ¯ F ¯ + ∑ k w j k u k − D ¯ {\displaystyle N_{j}={\frac {P_{j}-{\bar {D}}}{{\bar {F}}+\sum _{k}w_{jk}u_{k}-{\bar {D}}}}} where N j {\displaystyle N_{j}} = intensity normalized X-ray projection P j {\displaystyle P_{j}} = raw X-ray projection F ¯ {\displaystyle {\bar {F}}} = mean flat field image (average of flat fields) u k {\displaystyle u_{k}} = k-th eigen flat field w j k {\displaystyle w_{jk}} = weight of the eigen flat field u k {\displaystyle u_{k}} D ¯ {\displaystyle {\bar {D}}} = mean dark field (average of dark fields)

Distributed file system for cloud

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations (create, delete, modify, read, write) on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system. Users can share computing resources through the Internet thanks to cloud computing which is typically characterized by scalable and elastic resources – such as physical servers, applications and any services that are virtualized and allocated dynamically. Synchronization is required to make sure that all devices are up-to-date. Distributed file systems enable many big, medium, and small enterprises to store and access their remote data as they do local data, facilitating the use of variable resources. == Overview == === History === Today, there are many implementations of distributed file systems. The first file servers were developed by researchers in the 1970s. Sun Microsystem's Network File System became available in the 1980s. Before that, people who wanted to share files used the sneakernet method, physically transporting files on storage media from place to place. Once computer networks started to proliferate, it became obvious that the existing file systems had many limitations and were unsuitable for multi-user environments. Users initially used FTP to share files. FTP first ran on the PDP-10 at the end of 1973. Even with FTP, files needed to be copied from the source computer onto a server and then from the server onto the destination computer. Users were required to know the physical addresses of all computers involved with the file sharing. === Supporting techniques === Modern data centers must support large, heterogenous environments, consisting of large numbers of computers of varying capacities. Cloud computing coordinates the operation of all such systems, with techniques such as data center networking (DCN), the MapReduce framework, which supports data-intensive computing applications in parallel and distributed systems, and virtualization techniques that provide dynamic resource allocation, allowing multiple operating systems to coexist on the same physical server. === Applications === Cloud computing provides large-scale computing thanks to its ability to provide the needed CPU and storage resources to the user with complete transparency. This makes cloud computing particularly suited to support different types of applications that require large-scale distributed processing. This data-intensive computing needs a high performance file system that can share data between virtual machines (VM). Cloud computing dynamically allocates the needed resources, releasing them once a task is finished, requiring users to pay only for needed services, often via a service-level agreement. Cloud computing and cluster computing paradigms are becoming increasingly important to industrial data processing and scientific applications such as astronomy and physics, which frequently require the availability of large numbers of computers to carry out experiments. == Architectures == Most distributed file systems are built on the client-server architecture, but other, decentralized, solutions exist as well. === Client-server architecture === Network File System (NFS) uses a client-server architecture, which allows sharing of files between a number of machines on a network as if they were located locally, providing a standardized view. The NFS protocol allows heterogeneous clients' processes, probably running on different machines and under different operating systems, to access files on a distant server, ignoring the actual location of files. Relying on a single server results in the NFS protocol suffering from potentially low availability and poor scalability. Using multiple servers does not solve the availability problem since each server is working independently. The model of NFS is a remote file service. This model is also called the remote access model, which is in contrast with the upload/download model: Remote access model: Provides transparency, the client has access to a file. He sends requests to the remote file (while the file remains on the server). Upload/download model: The client can access the file only locally. It means that the client has to download the file, make modifications, and upload it again, to be used by others' clients. The file system used by NFS is almost the same as the one used by Unix systems. Files are hierarchically organized into a naming graph in which directories and files are represented by nodes. === Cluster-based architectures === A cluster-based architecture ameliorates some of the issues in client-server architectures, improving the execution of applications in parallel. The technique used here is file-striping: a file is split into multiple chunks, which are "striped" across several storage servers. The goal is to allow access to different parts of a file in parallel. If the application does not benefit from this technique, then it would be more convenient to store different files on different servers. However, when it comes to organizing a distributed file system for large data centers, such as Amazon and Google, that offer services to web clients allowing multiple operations (reading, updating, deleting,...) to a large number of files distributed among a large number of computers, then cluster-based solutions become more beneficial. Note that having a large number of computers may mean more hardware failures. Two of the most widely used distributed file systems (DFS) of this type are the Google File System (GFS) and the Hadoop Distributed File System (HDFS). The file systems of both are implemented by user level processes running on top of a standard operating system (Linux in the case of GFS). ==== Design principles ==== ===== Goals ===== Google File System (GFS) and Hadoop Distributed File System (HDFS) are specifically built for handling batch processing on very large data sets. For that, the following hypotheses must be taken into account: High availability: the cluster can contain thousands of file servers and some of them can be down at any time A server belongs to a rack, a room, a data center, a country, and a continent, in order to precisely identify its geographical location The size of a file can vary from many gigabytes to many terabytes. The file system should be able to support a massive number of files The need to support append operations and allow file contents to be visible even while a file is being written Communication is reliable among working machines: TCP/IP is used with a remote procedure call RPC communication abstraction. TCP allows the client to know almost immediately when there is a problem and a need to make a new connection. ===== Load balancing ===== Load balancing is essential for efficient operation in distributed environments. It means distributing work among different servers, fairly, in order to get more work done in the same amount of time and to serve clients faster. In a system containing N chunkservers in a cloud (N being 1000, 10000, or more), where a certain number of files are stored, each file is split into several parts or chunks of fixed size (for example, 64 megabytes), the load of each chunkserver being proportional to the number of chunks hosted by the server. In a load-balanced cloud, resources can be efficiently used while maximizing the performance of MapReduce-based applications. ===== Load rebalancing ===== In a cloud computing environment, failure is the norm, and chunkservers may be upgraded, replaced, and added to the system. Files can also be dynamically created, deleted, and appended. That leads to load imbalance in a distributed file system, meaning that the file chunks are not distributed equitably between the servers. Distributed file systems in clouds such as GFS and HDFS rely on central or master servers or nodes (Master for GFS and NameNode for HDFS) to manage the metadata and the load balancing. The master rebalances replicas periodically: data must be moved from one DataNode/chunkserver to another if free space on the first server falls below a certain threshold. However, this centralized approach can become a bottleneck for those master servers, if they become unable to manage a large number of file accesses, as it increases their already heavy loads. The load rebalance problem is NP-hard. In order to get a large number of chunkservers to work in collaboration, and to

Anthrobotics

Anthrobotics is the science of developing and studying robots that are either entirely or in some way human-like. The term anthrobotics was originally coined by Mark Rosheim in a paper entitled "Design of An Omnidirectional Arm" presented at the IEEE International Conference on Robotics and Automation, May 13–18, 1990, pp. 2162–2167. Rosheim says he derived the term from "...Anthropomorphic and Robotics to distinguish the new generation of dexterous robots from its simple industrial robot forebears." The word gained wider recognition as a result of its use in the title of Rosheim's subsequent book Robot Evolution: The Development of Anthrobotics, which focussed on facsimiles of human physical and psychological skills and attributes. However, a wider definition of the term anthrobotics has been proposed, in which the meaning is derived from anthropology rather than anthropomorphic. This usage includes robots that respond to input in a human-like fashion, rather than simply mimicking human actions, thus theoretically being able to respond more flexibly or to adapt to unforeseen circumstances. This expanded definition also encompasses robots that are situated in social environments with the ability to respond to those environments appropriately, such as insect robots, robotic pets, and the like. Anthrobotics is now taught at some universities, encouraging students not only to design and build robots for environments beyond current industrial applications, but also to speculate on the future of robotics that are embedded in the world at large, as mobile phones and computers are today. In 2016 philosopher Luis de Miranda created the Anthrobotics Cluster at the University of Edinburgh "a platform of cross-disciplinary research that seeks to investigate some of the biggest questions that will need to be answered" on the relationship between humans, robots and intelligent systems and "a think tank on the social spread of robotics, and also how automation is part of the definition of what humans have always been". to explore the symbiotic relationship between humans and automated protocols.

Cozi

Cozi is a family organization website and mobile app designed to streamline household management. It offers shared calendars, to-do lists, shopping lists, and messaging tools, allowing multiple users to coordinate under one account. Founded in 2005 by former Microsoft employees, Cozi has evolved through acquisitions and now operates under OurFamilyWizard. The app is available in both free and premium versions on iOS, Android, and desktop platforms. == History == Cozi was founded in 2005 by Robbie Cape and Jan Miksovsky, two former Microsoft employees who sought to simplify family logistics with technology. The company's first product, Cozi Central, was released on September 25, 2006, and included a family calendar, shopping lists, family messaging and a photo collage screensaver. The company is based in Seattle, Washington. Cozi has both a freemium version, and a paid version called Cozi Gold. Cozi Gold's additional features include Cozi Contacts, a birthday tracker, more reminders, mobile month view, and change notifications. The software can be used on desktop or mobile applications for iOS and Android. On June 5, 2011, Cozi set a Guinness World Record for the longest line of ducks in a row. The line stretched for one mile and was made up of 17,782 rubber ducks. Cozi was acquired by Time Inc. in 2014. After the Meredith Corporation acquired Time in 2018, Cozi was moved into the Parents Network division. On May 4, 2022, Cozi was acquired by OurFamilyWizard of Minneapolis, Minnesota, reporting more than 20 million registered users.