Ashish Vaswani

Ashish Vaswani

Ashish Vaswani is an Indian computer scientist and entrepreneur. He conducted research at Google Brain, co-founded Adept AI, and, as of 2025, was co-founder and chief executive officer of Essential AI. Vaswani is a co-author of the 2017 paper "Attention Is All You Need", which introduced the Transformer neural network architecture. The Transformer model has been used in the development of subsequent NLP models BERT, ChatGPT, and their successors. == Career == Vaswani completed his engineering in Computer Science from Birla Institute of Technology, Mesra (BIT Mesra) in 2002. In 2004, he enrolled at the University of Southern California for graduate studies. He earned his PhD in Computer Science at the University of Southern California supervised by David Chiang. During his research career at Google, Vaswani was part of the Google Brain team, where he conducted the work leading to the 'Attention Is All You Need' publication. Prior to joining Google, he was affiliated with the Information Sciences Institute at the University of Southern California. After Google, Vaswani co-founded Adept AI, a machine learning-focused startup that developed AI agents and tools for software automation. He has since left the company. He later co-founded Essential AI with Niki Parmar. As of 2025, he was chief executive officer of Essential AI. == Notable works == Vaswani's most notable paper, "Attention Is All You Need", was published in 2017. The paper introduced the Transformer model, which uses self-attention mechanisms instead of recurrence for sequence-to-sequence tasks. The Transformer architecture has become foundational to modern language models and NLP systems, including BERT (2018), GPT-2, GPT-3 (2019–2020) and many more recent models. The "Attention Is All You Need" paper is among the most cited papers in machine learning.

CMU Pronouncing Dictionary

The CMU Pronouncing Dictionary (also known as CMUdict) is an open-source pronouncing dictionary originally created by the Speech Group at Carnegie Mellon University (CMU) for use in speech recognition research. CMUdict provides a mapping orthographic/phonetic for English words in their North American pronunciations. It is commonly used to generate representations for speech recognition (ASR), e.g. the CMU Sphinx system, and speech synthesis (TTS), e.g. the Festival system. CMUdict can be used as a training corpus for building statistical grapheme-to-phoneme (g2p) models that will generate pronunciations for words not yet included in the dictionary. The most recent release is 0.7b; it contains over 134,000 entries. An interactive lookup version is available. == Database format == The database is distributed as a plain text file with one entry to a line in the format "WORD " with a two-space separator between the parts. If multiple pronunciations are available for a word, variants are identified using numbered versions (e.g. WORD(1)). The pronunciation is encoded using a modified form of the ARPABET system, with the addition of stress marks on vowels of levels 0, 1, and 2. A line-initial ;;; token indicates a comment. A derived format, directly suitable for speech recognition engines is also available as part of the distribution; this format collapses stress distinctions (typically not used in ASR). The following is a table of phonemes used by CMU Pronouncing Dictionary. == History == == Applications == The Unifon converter is based on the CMU Pronouncing Dictionary. The Natural Language Toolkit contains an interface to the CMU Pronouncing Dictionary. The Carnegie Mellon Logios tool incorporates the CMU Pronouncing Dictionary. PronunDict, a pronunciation dictionary of American English, uses the CMU Pronouncing Dictionary as its data source. Pronunciation is transcribed in IPA symbols. This dictionary also supports searching by pronunciation. Some singing voice synthesizer software like CeVIO Creative Studio and Synthesizer V uses modified version of CMU Pronouncing Dictionary for synthesizing English singing voices. Transcriber, a tool for the full text phonetic transcription, uses the CMU Pronouncing Dictionary 15.ai, a real-time text-to-speech tool using artificial intelligence, uses the CMU Pronouncing Dictionary

Yaron Singer

Yaron Singer is a computer scientist and entrepreneur whose work has focused on algorithms, machine learning, optimization, and artificial intelligence security. He was the Gordon McKay Professor of Computer Science and Applied Mathematics at Harvard University and co-founded Robust Intelligence, an artificial intelligence security company acquired by Cisco Systems in 2024. == Education == Singer received a PhD in computer science from the University of California, Berkeley under the supervision of Christos Papadimitriou. == Academic career == Singer was a postdoctoral research scientist at Google Research. Singer joined the computer science faculty at Harvard John A. Paulson School of Engineering and Applied Sciences in 2013 and became a full professor in 2019. == Research == Singer's research has focused on algorithms and machine learning, including optimization, algorithmic mechanism design, and adversarial machine learning. His doctoral work studied computational limits in algorithmic mechanism design, including truthful mechanisms and budget-feasible mechanisms. In optimization, Singer co-authored work on submodular optimization and parallel algorithms for large-scale data processing. Singer has also worked on adversarial machine learning, including attacks that use small perturbations or noise to affect the behavior of machine learning systems. == Entrepreneurship == In 2020, Singer co-founded Robust Intelligence Kojin Oshiba. Harvard SEAS reported that the company raised $14 million that year, and TechCrunch reported in 2021 that the company raised a $30 million Series B round led by Tiger Global. The company developed tools for testing AI models and detecting failures before or during deployment. TechCrunch described its RIME product as using an "AI firewall" to stress-test models. In 2024, Cisco Systems acquired Robust Intelligence. CTech reported that Cisco had not disclosed the purchase amount when the acquisition was announced, and later reported the deal value as $400 million. In 2025, Cisco launched Foundation AI, a Cisco team focused on AI for cybersecurity. Techzine reported that Singer led the team and was Cisco's VP of AI and Security. == Recognition == Singer has received a Sloan Research Fellowship, an NSF CAREER Award, a Google Faculty Research Award, and a Facebook Faculty Award. As a graduate student, he received Microsoft Research and Facebook fellowships. In 2012, he received the Best Student Paper Award at the ACM International Conference on Web Search and Data Mining for "How to Win Friends and Influence People, Truthfully: Influence Maximization Mechanisms for Social Networks."

MemoQ

memoQ is a computer-assisted translation software suite which runs on Microsoft Windows operating systems. It is developed by the Hungarian software company memoQ Fordítástechnológiai Zrt. (memoQ Translation Technologies), formerly Kilgray, a provider of translation management software established in 2004 and cited as one of the fastest-growing companies in the translation technology sector in 2012, and 2013. memoQ provides translation memory, terminology, machine translation integration and reference information management in desktop, client/server and web application environments. == History == memoQ, a translation environment tool first released in 2006, was the first product created by memoQ Translation Technologies, a company founded in Hungary by the three language technologists Balázs Kis, István Lengyel and Gábor Ugray. In the years since the software was first presented, it has grown in popularity and is now among the most frequent TEnT applications used for translation (it was rated as the third most used CAT tool in a Proz.com study in 2013 and as the second most widely used tool in a June 2010 survey of 458 working translators), after SDL Trados, Wordfast, Déjà Vu, OmegaT and others. Today it is available in desktop versions for translators (Translator Pro edition), and project managers (Project Manager edition), as well as site-installed and hosted server applications offering integration with the desktop versions and a web browser interface. There are currently several active online forums in which users provide each other with independent advice and support on the software's functions, as well as many online tutorials created by professional trainers and active users. Before its commercial debut, a version of memoQ (2.0) was distributed as postcardware.

Jiliang Tang

Jiliang Tang is a Chinese-born computer scientist and a University Foundation Professor of Computer Science and Engineering at Michigan State University, where he is the director of the Data Science and Engineering (DSE) Lab. His research expertise is in data mining and machine learning. == Education and career == He received his BEng in software engineering (2008) and MSc in computer science (2010) from the Beijing Institute of Technology, Beijing, China. His PhD is from Arizona State University (2015), under the direction of Huan Liu. After gaining his PhD, he worked as a research scientist at Yahoo Labs (2015–16) before joining Michigan State University as an assistant professor (2016). His research has mostly been published jointly with Huan Liu. It has received over thirteen thousand citations documented by Google Scholar, and has received coverage in the media. == Awards == He has received the 2020 ACM SIGKDD Rising Star Award that "aims to celebrate the early accomplishments of the SIGKDD communities' brightest new minds", NSF Career Award, and Michigan State University's Distinguished Withrow Research Award. == Selected publications == === Books === Jiliang Tang, Huan Liu. Trust in Social Media, (Synthesis digital library of engineering and computer science; Synthesis lectures on information security, privacy, and trust, # 13) Morgan & Claypool, 2015 ISBN 9781627054058 === Peer reviewed journal articles === Shu K, Sliva A, Wang S, Tang J, Liu H. Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter. 2017 Sep 1;19(1):22-36. [1] Tang J, Alelyani S, Liu H. Feature selection for classification: A review. Data classification: Algorithms and applications. 2014:37. [2] Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H. Feature selection: A data perspective. ACM Computing Surveys (CSUR). 2017 Dec 6;50(6):1-45. [3] Chang S, Han W, Tang J, Qi GJ, Aggarwal CC, Huang TS. Heterogeneous network embedding via deep architectures. InProceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining 2015 Aug 10 (pp. 119–128) Gao H, Tang J, Hu X, Liu H. Exploring temporal effects for location recommendation on location-based social networks. InProceedings of the 7th ACM conference on Recommender systems 2013 Oct 12 (pp. 93–100). Hu X, Tang J, Gao H, Liu H. Unsupervised sentiment analysis with emotional signals. InProceedings of the 22nd international conference on World Wide Web 2013 May 13 (pp. 607–618).

Depth peeling

In computer graphics, depth peeling is an exact multipass method of order-independent transparency that extracts transparent fragments into depth layers and composites those layers in depth order. Depth peeling has the advantage of being able to generate correct results even for complex images containing intersecting transparent objects. == Method == Depth peeling works by rendering the image multiple times. Depth peeling uses two Z buffers, one that works conventionally, and one that is not modified, and sets the minimum distance at which a fragment can be drawn without being discarded. For each pass, the previous pass' conventional Z-buffer is used as the minimal Z-buffer, so each pass removes already-captured nearer fragments and draws the next depth layer behind them. The resulting images can then be composited in depth order to form a single image. A major drawback of classical depth peeling is performance: it requires one geometry pass per peeled layer, so scenes with high depth complexity require many passes that each re-rasterize the transparent geometry. Later variants reduce the number of passes by peeling multiple layers or both front and back layers in a pass. Dual depth peeling reduces the geometry-pass count from N to N/2+1 by peeling one layer from the front and one from the back in each pass, while multi-layer depth peeling peels several layers per pass and reported up to an 8x speed-up in RGBA8 settings.

Wasserstein GAN

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches". Compared with the original GAN discriminator, the Wasserstein GAN discriminator provides a better learning signal to the generator. This allows the training to be more stable when generator is learning distributions in very high dimensional spaces. == Motivation == === The GAN game === The original GAN method is based on the GAN game, a zero-sum game with 2 players: generator and discriminator. The game is defined over a probability space ( Ω , B , μ r e f ) {\displaystyle (\Omega ,{\mathcal {B}},\mu _{ref})} , The generator's strategy set is the set of all probability measures μ G {\displaystyle \mu _{G}} on ( Ω , B ) {\displaystyle (\Omega ,{\mathcal {B}})} , and the discriminator's strategy set is the set of measurable functions D : Ω → [ 0 , 1 ] {\displaystyle D:\Omega \to [0,1]} . The objective of the game is L ( μ G , D ) := E x ∼ μ r e f [ ln ⁡ D ( x ) ] + E x ∼ μ G [ ln ⁡ ( 1 − D ( x ) ) ] . {\displaystyle L(\mu _{G},D):=\mathbb {E} _{x\sim \mu _{ref}}[\ln D(x)]+\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))].} The generator aims to minimize it, and the discriminator aims to maximize it. A basic theorem of the GAN game states that Repeat the GAN game many times, each time with the generator moving first, and the discriminator moving second. Each time the generator μ G {\displaystyle \mu _{G}} changes, the discriminator must adapt by approaching the ideal D ∗ ( x ) = d μ r e f d ( μ r e f + μ G ) . {\displaystyle D^{}(x)={\frac {d\mu _{ref}}{d(\mu _{ref}+\mu _{G})}}.} Since we are really interested in μ r e f {\displaystyle \mu _{ref}} , the discriminator function D {\displaystyle D} is by itself rather uninteresting. It merely keeps track of the likelihood ratio between the generator distribution and the reference distribution. At equilibrium, the discriminator is just outputting 1 2 {\displaystyle {\frac {1}{2}}} constantly, having given up trying to perceive any difference. Concretely, in the GAN game, let us fix a generator μ G {\displaystyle \mu _{G}} , and improve the discriminator step-by-step, with μ D , t {\displaystyle \mu _{D,t}} being the discriminator at step t {\displaystyle t} . Then we (ideally) have L ( μ G , μ D , 1 ) ≤ L ( μ G , μ D , 2 ) ≤ ⋯ ≤ max μ D L ( μ G , μ D ) = 2 D J S ( μ r e f ‖ μ G ) − 2 ln ⁡ 2 , {\displaystyle L(\mu _{G},\mu _{D,1})\leq L(\mu _{G},\mu _{D,2})\leq \cdots \leq \max _{\mu _{D}}L(\mu _{G},\mu _{D})=2D_{JS}(\mu _{ref}\|\mu _{G})-2\ln 2,} so we see that the discriminator is actually lower-bounding D J S ( μ r e f ‖ μ G ) {\displaystyle D_{JS}(\mu _{ref}\|\mu _{G})} . === Wasserstein distance === Thus, we see that the point of the discriminator is mainly as a critic to provide feedback for the generator, about "how far it is from perfection", where "far" is defined as Jensen–Shannon divergence. Naturally, this brings the possibility of using a different criteria of farness. There are many possible divergences to choose from, such as the f-divergence family, which would give the f-GAN. The Wasserstein GAN is obtained by using the Wasserstein metric, which satisfies a "dual representation theorem" that renders it highly efficient to compute: A proof can be found in the main page on Wasserstein metric. == Definition == By the Kantorovich-Rubenstein duality, the definition of Wasserstein GAN is clear:A Wasserstein GAN game is defined by a probability space ( Ω , B , μ r e f ) {\displaystyle (\Omega ,{\mathcal {B}},\mu _{ref})} , where Ω {\displaystyle \Omega } is a metric space, and a constant K > 0 {\displaystyle K>0} . There are 2 players: generator and discriminator (also called "critic"). The generator's strategy set is the set of all probability measures μ G {\displaystyle \mu _{G}} on ( Ω , B ) {\displaystyle (\Omega ,{\mathcal {B}})} . The discriminator's strategy set is the set of measurable functions of type D : Ω → R {\displaystyle D:\Omega \to \mathbb {R} } with bounded Lipschitz-norm: ‖ D ‖ L ≤ K {\displaystyle \|D\|_{L}\leq K} . The Wasserstein GAN game is a zero-sum game, with objective function L W G A N ( μ G , D ) := E x ∼ μ G [ D ( x ) ] − E x ∼ μ r e f [ D ( x ) ] . {\displaystyle L_{WGAN}(\mu _{G},D):=\mathbb {E} _{x\sim \mu _{G}}[D(x)]-\mathbb {E} _{x\sim \mu _{ref}}[D(x)].} The generator goes first, and the discriminator goes second. The generator aims to minimize the objective, and the discriminator aims to maximize the objective: min μ G max D L W G A N ( μ G , D ) . {\displaystyle \min _{\mu _{G}}\max _{D}L_{WGAN}(\mu _{G},D).} By the Kantorovich-Rubenstein duality, for any generator strategy μ G {\displaystyle \mu _{G}} , the optimal reply by the discriminator is D ∗ {\displaystyle D^{}} , such that L W G A N ( μ G , D ∗ ) = K ⋅ W 1 ( μ G , μ r e f ) . {\displaystyle L_{WGAN}(\mu _{G},D^{})=K\cdot W_{1}(\mu _{G},\mu _{ref}).} Consequently, if the discriminator is good, the generator would be constantly pushed to minimize W 1 ( μ G , μ r e f ) {\displaystyle W_{1}(\mu _{G},\mu _{ref})} , and the optimal strategy for the generator is just μ G = μ r e f {\displaystyle \mu _{G}=\mu _{ref}} , as it should. == Comparison with GAN == In the Wasserstein GAN game, the discriminator provides a better gradient than in the GAN game. Consider for example a game on the real line where both μ G {\displaystyle \mu _{G}} and μ r e f {\displaystyle \mu _{ref}} are Gaussian. Then the optimal Wasserstein critic D W G A N {\displaystyle D_{WGAN}} and the optimal GAN discriminator D {\displaystyle D} are plotted as below: For fixed discriminator, the generator needs to minimize the following objectives: For GAN, E x ∼ μ G [ ln ⁡ ( 1 − D ( x ) ) ] {\displaystyle \mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))]} . For Wasserstein GAN, E x ∼ μ G [ D W G A N ( x ) ] {\displaystyle \mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)]} . Let μ G {\displaystyle \mu _{G}} be parametrized by θ {\displaystyle \theta } , then we can perform stochastic gradient descent by using two unbiased estimators of the gradient: ∇ θ E x ∼ μ G [ ln ⁡ ( 1 − D ( x ) ) ] = E x ∼ μ G [ ln ⁡ ( 1 − D ( x ) ) ⋅ ∇ θ ln ⁡ ρ μ G ( x ) ] {\displaystyle \nabla _{\theta }\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))]=\mathbb {E} _{x\sim \mu _{G}}[\ln(1-D(x))\cdot \nabla _{\theta }\ln \rho _{\mu _{G}}(x)]} ∇ θ E x ∼ μ G [ D W G A N ( x ) ] = E x ∼ μ G [ D W G A N ( x ) ⋅ ∇ θ ln ⁡ ρ μ G ( x ) ] {\displaystyle \nabla _{\theta }\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)]=\mathbb {E} _{x\sim \mu _{G}}[D_{WGAN}(x)\cdot \nabla _{\theta }\ln \rho _{\mu _{G}}(x)]} where we used the reparameterization trick. As shown, the generator in GAN is motivated to let its μ G {\displaystyle \mu _{G}} "slide down the peak" of ln ⁡ ( 1 − D ( x ) ) {\displaystyle \ln(1-D(x))} . Similarly for the generator in Wasserstein GAN. For Wasserstein GAN, D W G A N {\displaystyle D_{WGAN}} has gradient 1 almost everywhere, while for GAN, ln ⁡ ( 1 − D ) {\displaystyle \ln(1-D)} has flat gradient in the middle, and steep gradient elsewhere. As a result, the variance for the estimator in GAN is usually much larger than that in Wasserstein GAN. See also Figure 3 of. The problem with D J S {\displaystyle D_{JS}} is much more severe in actual machine learning situations. Consider training a GAN to generate ImageNet, a collection of photos of size 256-by-256. The space of all such photos is R 256 2 {\displaystyle \mathbb {R} ^{256^{2}}} , and the distribution of ImageNet pictures, μ r e f {\displaystyle \mu _{ref}} , concentrates on a manifold of much lower dimension in it. Consequently, any generator strategy μ G {\displaystyle \mu _{G}} would almost surely be entirely disjoint from μ r e f {\displaystyle \mu _{ref}} , making D J S ( μ G ‖ μ r e f ) = + ∞ {\displaystyle D_{JS}(\mu _{G}\|\mu _{ref})=+\infty } . Thus, a good discriminator can almost perfectly distinguish μ r e f {\displaystyle \mu _{ref}} from μ G {\displaystyle \mu _{G}} , as well as any μ G ′ {\displaystyle \mu _{G}'} close to μ G {\displaystyle \mu _{G}} . Thus, the gradient ∇ μ G L ( μ G , D ) ≈ 0 {\displaystyle \nabla _{\mu _{G}}L(\mu _{G},D)\approx 0} , creating no learning signal for the generator. Detailed theorems can be found in. == Training Wasserstein GANs == Training the generator in Wasserstein GAN is just gradient descent, the same as in GAN (or most deep learning methods), but training the discriminator is different, as the discriminator is now restricted to have bounded Lipschitz norm. There are several methods for this. === Upper-bounding the Lipschitz norm === Let the discriminator function D {\displaystyle D} to be implemented by a multilayer perceptron: D = D n ∘ D n − 1 ∘ ⋯ ∘ D 1 {\displaystyle D=D_{n}\circ D_{n-1}\circ \cdots \circ D_{1}} where D i ( x ) = h ( W i x ) {\displaystyle D_{i}(x)=h(W_