José Valim's Twitter Thread

If I am reading this correctly, LLMs tend to do fairly well with Elixir across the board and it has the best upper bound of all?

@josevalim I wonder if LLMs are giving us a way to quantify claims about language productivity :)

@josevalim LLMs tend to do well with languages and frameworks that map normal language onto the output code without a ton of implementation code that the average person wouldn't be able to connect to their request. Elixir comes from an era that prized that sort of system.

@josevalim I wonder if it relates to fewer language design churn and in average better designed codebases (while maybe fewer)

@josevalim My theory is Elixir code in training data is generally high quality

@josevalim Elixir for the win! Best upper bound, best non-reasoning, and best reasoning scores! If I’m reading it correctly.

@josevalim I think it is because we have a smaller Elixir codebase yet. And that codebase is well written, basically because most of the code were done by people that are using it since the beginning so learned a lot of correct patterns. Basically, other languages have a lot of noise.

@josevalim This has been my experience recently switching to Elixir from a fairly mature Go codebase. LLMs in general seem to do great with functional languages, maybe it's a token density thing? 🤔 Either way I'm loving Elixir so far; much easier to design complex systems. Thank you!

@josevalim 1) More restrictive functional languages that create fewer bugs per LOC likely do better; 2) Languages that experienced or great developers tend to use, likely tend to do better (trained on better code); 3) Languages that people pick for passion and not because the bossman

@josevalim Indeed. We built a mass mailer app with Elixir, our first foray in this language, and Claude. It was faster to develop. We love Elixir.

@josevalim Moreover, docs are first-class citizens in Elixir. The overall quality of open-sourced libraries is above average. It helps a lot to get an idea of what's going on for LLMs

@josevalim Causal elixir W

@josevalim The second highest is Racket! Wondering if it's FP at play.

@josevalim Cross-language performance comparisons are invalid, you can only compare models within the same language. Without knowing the relative difficulty of each dataset, it's plausible the Elixir corpus simply contains easier problems.

@josevalim Cloud code performs quite well for us at @operately with elixir, but unfortunately, I wouldn't be able to say that it performs as well as I see it perform with TypeScript. For example, I frequently find Ruby-ish code generated for elixir, especially in tests or heavy DSL-like

@josevalim That aligns with my experience. All the latest models do well with Elixir with proper specs and implementation plans.

@josevalim @guitaripod confirmed ?

@josevalim I wonder if this includes with macros DSLs; that is where I found the ineffectiveness comes from

@josevalim Watch my elixirconf talk! (I go on a tangent about an LLM test I did)

@josevalim I must confess, I don't use LLMs a lot, but the Elixir code it generates is not pretty: deep nesting, case statements instead of pattern matched functions, defensive programing...not really "idiomatic"

@josevalim I think this benchmark doesn’t have a way of accounting for use-cases where feedback loops with the language tooling (language server, compiler, linters, etc) come into play. Still, very interesting.

@josevalim Guessing the amount of code VS quality of the code that is being used in training makes a difference?

@josevalim My (uninformed) theory is that pattern matching and immutability are easier for LLMs to understand.

@josevalim Ghost in the machine respects the ancient magic of Beam.

@josevalim @mburu_warui 👀

@josevalim Interesting data

Share this thread

Read on Twitter

Navigate thread