"Correcting Length Bias in Neural Machine Translation" Kenton Murray and David Chiang. WMT 2018 (to appear)

"Incident-Driven Machine Translation and Name Tagging for Low-resource Languages" Ulf Hermjakob et al. Journal of Machine Translation, pp 59-89. June 2018

"Probabilistic Neural Programs" Kenton Murray and Jayant Krishnamurthy. NAMPI 2016

"Auto-Sizing Neural Networks: With Applications to n-gram Language Models" Kenton Murray and David Chiang. EMNLP 2015

"CoBaFi - Collaborative Bayesian Filtering" Alex Beutel, Kenton Murray, Christos Faloutsos, and Alexander J. Smola. WWW 2014

"QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic Spoken Language Translation" Hassan Sajjad, Francisco Guzman, Preslav Nakov, Ahmed Abdelali, Kenton Murray, Fahad Al Obaidli, Stephan Vogel. IWSLT 2013

"CMU @ WMT 2013: Syntax, Synthetic Translation Options, and Pseudo-References" Waleed Ammar, Victor Chahuneau, Michael Denkowski, Greg Hanneman, Wang Ling, Austin Matthews, Kenton Murray, Nicola Segall, Yulia Tsvetkov, Alon Lavie and Chris Dyer. WMT 2013

My Master's Thesis at Carnegie Mellon
Master's Thesis Carnegie Mellon University's Language Technology Institute.


Anomalous pattern detection is a popular sub field in computer science aimed at detecting anomalous items and groupings of items in a dataset using methods from machine learning, data mining, and statistics. For anomaly detection tasks consisting of geospatially and temporally labeled data, spatial scan statistics have been successfully applied to numerous spatiotemporal data mining and pattern detection problems such as predicting crime waves or outbreaks of diseases [12, 7, 14, 15]. However, spatial scan statistics are limited by the ability to only scan over a structured set of data streams. When spatiotemporal data sets contain unstructured free text, spatial scan statistics require preprocessing data into structured categories. Manual labeling and annotating text can be time consuming or infeasible, while automatic classification methods that assign text fields into a pre-defi ned set of event types can obscure the occurrence of novel events - such as a disease outbreak with a previously unseen pattern of symptoms - potentially drowning out the signal of the exact outliers the method is attempting to detect.

In this thesis, we propose the Semantic Scan Statistic, which integrates spatial scanning with unsupervised topic modeling to enable timely and ac- curate detection of novel disease outbreaks. We discuss some of the inherent challenges of working with free text data in an anomalous pattern detection framework, and we present some novel approaches to the problem using topic models by focusing on specifically adapting topic modeling algorithms to enable anomaly detection. We evaluate our approach using two years of free-text Emergency Department chief complaint data from Allegheny Country, PA, demonstrating the efficacy of the Semantic Scan Statistic and the benefits of incorporating unstructured text for spatial event detection. Using semi-synthetic disease outbreaks, a common evaluation method of the disease surveillance field, we show the ability to detect outbreaks of diseases is over 25% faster than current state-of-the-art methods that do not use textual information.

Advisor: Daniel Neill
Comittee: Daniel Neill Chris Dyer Roni Rosenfeld

My Senior Thesis at Princeton


Latent Dirichlet allocation, or LDA, is a successful, generative, probabilistic model of text corpora that has performed well in many tasks in many areas of Natural Language Processing. Despite being perfectly suited for Automatic Summarization tasks, it has never been applied to them. In this paper, I introduce Summarization by LDA, or SLDA, which better models the subtopics of a document leading to more pertinent, relevant, and concise summaries than other summarization methods. This new approach is competitive with the leading methods in the field and even outperforms them in many aspects. In addition to SLDA, I introduce a novel, paradigm-shifting, evaluation technique of summarization that does not rely on gold-standards. It overcomes many of the challenges imposed by inherent disagreements amongst people of what a good summary is by evaluating over large numbers of people using the commercial service, Mechanical Turk. Overall, this paper lays the ground work for transforming the conventions of the Automatic Summarization field by challenging many definitions.

Advisor: David Blei
Committee: David Blei Andrea LaPaugh

2012 Presidential Campaign Speeches: A small corpus of the subset of campaign speeches delivered by Mitt Romney and Barack Obama during the 2012 Presidential Campaign. Total size of 50 speeches and around 120K words. Approximately equal number of words from each candidate, but 31 and 19 speeches respectively.