Richard Diehl Martinez

I am a fourth-year Computer Science Ph.D. student and Gates Scholar at the University of Cambridge. Previously, I worked as an Applied Research Scientist at Amazon Alexa, focusing on language modeling research. I have a M.S. in Computer Science and a B.S. in Management Science from Stanford University.

Currently, I research pre-training techniques to improve the performance of small language models relative to large models. More broadly, my interests lie at the intersection of machine learning, linguistics, and neuroscience. If you're curious, check out some of my papers.

I also publish a bi-weekly NLP newsletter on Substack.

Select Publications

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Richard Diehl Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery, Lisa Beinborn

Conference: EMNLP 2024

Language models over-rely on token frequency during pre-training, leading to poor generalization for infrequent tokens and anisotropic representations. Our method, Syntactic Smoothing, induces a syntactic prior to improve model performance on rare tokens and reduce anisotropy.

Tending Towards Stability: Convergence Challenges in Small Language Models

Richard Diehl Martinez, Pietro Lesci, Paula Buttery

Conference: EMNLP Findings 2024

Smaller language models struggle to converge as efficiently as larger ones, especially in later training stages. Our analysis of the Pythia model suite investigates how the effective rank of parameters impacts convergence dynamics across model sizes.

SumTablets: A Transliteration Dataset of Sumerian Tablets

Cole Simmons, Richard Diehl Martinez, Dan Jurafsky

Workshop: ACL Workshop 2024

SumTablets offers the largest collection of Unicode glyph–transliteration pairs for Sumerian cuneiform tablets, enabling NLP techniques for transliteration with nearly 7 million glyphs across 91,606 tablets. Released as a Hugging Face Dataset.

CLIMB: Curriculum Learning for Infant-inspired Model Building

Richard Diehl Martinez, Hope McGovern, Zebulon Goriely, Christopher Davis, Andrew Caines, Paula Buttery, Lisa Beinborn

Conference: CoNNL 2023 (Best Paper)

Our CLIMB model, built for the BabyLM Challenge, uses cognitively-inspired curriculum learning to improve small language model training. We explore vocabulary, data, and objective curricula to enhance linguistic generalization capabilities.

Attention-based Contextual Language Model Adaptation for Speech Recognition

Richard Diehl Martinez, Scott Novotney, Ivan Bulyko, Ariya Rastrow, Andreas Stolcke, Ankur Gandhe

Conference: ACL 2021

We introduce a contextual attention mechanism for language models in speech recognition, incorporating non-linguistic data like utterance time. Our model outperforms conventional LMs, reducing perplexity by 9.0% in long-tail utterances.

Automatically Neutralizing Subjective Bias in Text

Reid Pryzant, Richard Diehl Martinez, Nathan Dass, Sadao Kurohashi, Dan Jurafsky, Diyi Yang

Conference: AAAI 2020

We present the first dataset for automatically neutralizing subjective bias in text, sourced from Wikipedia edits. Our BERT-based models achieve strong performance in identifying and neutralizing biased language across four domains.

Projects

Grapevine

An AI-powered recommendation tool that analyzes user preferences to suggest the perfect wine.

Ignition

An End-to-End Supervised Model for Training Simulated Self-Driving Vehicles.

Via: Illuminating Academic Pathways at Scale

A graph neural network framework that helps students create personalized academic journeys.

Optimizing Airbnb Listings with GANs

A GAN-based model that generates optimized Airbnb listing descriptions to boost booking rates.