"Auto-Sizing Neural Networks: With Applications to n-gram Language Models" Kenton Murray and David Chiang. EMNLP 2015

"CoBaFi - Collaborative Bayesian Filtering" Alex Beutel, Kenton Murray, Christos Faloutsos, and Alexander J. Smola. WWW 2014

"QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic Spoken Language Translation" Hassan Sajjad, Francisco Guzman, Preslav Nakov, Ahmed Abdelali, Kenton Murray, Fahad Al Obaidli, Stephan Vogel. IWSLT 2013

"CMU @ WMT 2013: Syntax, Synthetic Translation Options, and Pseudo-References" Waleed Ammar, Victor Chahuneau, Michael Denkowski, Greg Hanneman, Wang Ling, Austin Matthews, Kenton Murray, Nicola Segall, Yulia Tsvetkov, Alon Lavie and Chris Dyer. WMT 2013

I am currently investigating ways to improve phrase extraction for Machine Translation with a specific focus on morphologically rich languages such as Arabic. In particular, I am looking at ways to combine lexical probabilities from various models to get improved performance with more coverage and robustness. Additionally, I am looking at the impact that corpus level information can have on sentence level decisions. In other words, if the Arabic phrase ''أريد'' is frequently translated to ''I want'' in a corpus, can we better utilize this information?

In general, I am interested in statistical machine learning methods for dealing with human language. I find hierarchical graphical models, and in particular, non-parametric Bayesian models, to be a promising line of inquiry for dealing with natural language due to the inherent hierarchical structure in our communications. I have consistently used graphical models throughout many aspects of my research from novel Topic Models for Event Detection in my Master's Thesis, to Gaussian Clustering for Collaborative Filtering, to LDA for Automatic Summarization in my Undergrad Thesis. More detailed descriptions of these projects can be found below.

My Master's Thesis at Carnegie Mellon
Master's Thesis Carnegie Mellon University's Language Technology Institute.


Anomalous pattern detection is a popular sub field in computer science aimed at detecting anomalous items and groupings of items in a dataset using methods from machine learning, data mining, and statistics. For anomaly detection tasks consisting of geospatially and temporally labeled data, spatial scan statistics have been successfully applied to numerous spatiotemporal data mining and pattern detection problems such as predicting crime waves or outbreaks of diseases [12, 7, 14, 15]. However, spatial scan statistics are limited by the ability to only scan over a structured set of data streams. When spatiotemporal data sets contain unstructured free text, spatial scan statistics require preprocessing data into structured categories. Manual labeling and annotating text can be time consuming or infeasible, while automatic classification methods that assign text fields into a pre-defi ned set of event types can obscure the occurrence of novel events - such as a disease outbreak with a previously unseen pattern of symptoms - potentially drowning out the signal of the exact outliers the method is attempting to detect.

In this thesis, we propose the Semantic Scan Statistic, which integrates spatial scanning with unsupervised topic modeling to enable timely and ac- curate detection of novel disease outbreaks. We discuss some of the inherent challenges of working with free text data in an anomalous pattern detection framework, and we present some novel approaches to the problem using topic models by focusing on specifically adapting topic modeling algorithms to enable anomaly detection. We evaluate our approach using two years of free-text Emergency Department chief complaint data from Allegheny Country, PA, demonstrating the efficacy of the Semantic Scan Statistic and the benefits of incorporating unstructured text for spatial event detection. Using semi-synthetic disease outbreaks, a common evaluation method of the disease surveillance field, we show the ability to detect outbreaks of diseases is over 25% faster than current state-of-the-art methods that do not use textual information.

Advisor: Daniel Neill
Comittee: Daniel Neill Chris Dyer Roni Rosenfeld

My senior thesis at Princeton


Latent Dirichlet allocation, or LDA, is a successful, generative, probabilistic model of text corpora that has performed well in many tasks in many areas of Natural Language Processing. Despite being perfectly suited for Automatic Summarization tasks, it has never been applied to them. In this paper, I introduce Summarization by LDA, or SLDA, which better models the subtopics of a document leading to more pertinent, relevant, and concise summaries than other summarization methods. This new approach is competitive with the leading methods in the field and even outperforms them in many aspects. In addition to SLDA, I introduce a novel, paradigm-shifting, evaluation technique of summarization that does not rely on gold-standards. It overcomes many of the challenges imposed by inherent disagreements amongst people of what a good summary is by evaluating over large numbers of people using the commercial service, Mechanical Turk. Overall, this paper lays the ground work for transforming the conventions of the Automatic Summarization field by challenging many definitions.

Advisor: David Blei
Committee: David Blei Andrea LaPaugh

Robust Collaborative Bayesian Filtering

Basis for WWW 2014 Paper


Recommendation is a common challenge in computer science and a pervasive problem on the web today. However, many collaborative filtering schemes don't take into account many of the realities of recommendation on the web. We describe a Bayesian approach to Collaborative Filtering that takes into account prior knowledge about ratings, is flexible in the ratings distribution allowing the discrete and sometimes unusual nature of ratings, and simultaneously clusters both users and objects for both higher accuracy and robustness to spam. We investigate a number of different algorithms for fitting our models including Gibbs sampling, a Hamiltionian Monte Carlo method and stochastic Fischer scoring using natural gradients. We use these methods to show the advantages to our different models on different data sets including subsets of Netflix, MovieLens, and spammy datasets.

Joint work with Alex Beutel

Automatic Essay Scoring


Standardized tests are hampered by the manual effort required to score student-written essays. In this paper, we show how linear regression can be used to automatically grade essays on standardized tests. We combine simple, shallow features of the essays, such as character length and word length, with part-of-speech patterns. Our combined model gives significant reduction in prediction error. We discuss which features were effective in predicting scores.

Joint work with Naoki Orii

Bevara: An open-source android phone application for Field Linguistics to aid in corpus collection. Currently still under development and at a private alpha level.
2012 Presidential Campaign Speeches: A small corpus of the subset of campaign speeches delivered by Mitt Romney and Barack Obama during the 2012 Presidential Campaign. Total size of 50 speeches and around 120K words. Approximately equal number of words from each candidate, but 31 and 19 speeches respectively.