News
Now
Working on this website, filling out the gaps and experimenting with colors and typography. Bunch of client work that is being disrupted by writing blog posts, alas.
Previously
Creative Tech Demo Night at betaworks
Building with Claude | An evening with Anthropic
I gave a short demo of Metguessr at AI Tinkerers, with a focus on evals. Got to meet some folks from Anthropic after the event, which was great.
Can LLMs Grade Open Response Reading Questions?
A paper I coauthored was published in the International Journal of AI in Education. It evaluates LLMs at the task of grading student responses to reading comprehension questions. With Owen Henkel, Libby Hills, and Joshua McGrane.
LLM Evals and Benchmarking for Teaching Lab Fellows
I led a workshop for fellows and staff of Teaching Lab Studio. Topics included the role of evals in ML research, static benchmark datasets, dynamicc hat bot arenas, perspectives of leading practitioners, and building evals for AI product development.
Can Large Language Models Make the Grade?
A paper I coauthored was accepted as an ACM Learning @ Scale short paper. It evaluates LLMs at grading short answer responses across a variety of K-12 settings. With Owen Henkel, Libby Hills, Adam Boxer, and Zach Levonian.