I get it now. Benchmarks, in the end, are prompts for AI researchers.
If you want a problem solved, translate it into an AGI benchmark.
With enough patience, it becomes something AI researchers report on, optimize for, and ultimately saturate. Months later, the solution arrives; all you had to do was wait. AI researchers are an informal, lossy form of distributed computation - they mass-produce solutions and tools that, almost inevitably, solve the messy problem you started with.
"Reimplement Sid Meier's Alpha Centauri", but with modern graphics, smart AIs that role-play their personalities, all bugs fixed, a much better endgame, AI-generated unexpected events, and a dev console where you can mod the game via natural language instructions."
"Reimplement all linux command line utilities in Rust, make their names, arguments and options consistent, and fork all software and scripts on the internet to use the new versions."
Even if we were not past a hard takeoff point where AIs could decide for themselves what to work on, the things that would be created in all areas would be incredible.
Consider every time you played a game and thought it would be better if it had x,y, or z. Or you wished an application had this one simple nrw feature.
All those things would be possible to make. A lot of people will discover why their idea was a bad idea. Some will discover their idea was great, some will erroneously think their bad idea is great.
We will be inundated with the creation of those good and bad ideas. Some people will have ideas on how to manage that flood of new creations, and create tools to help out, some of those tools will be good and some of them will be bad, there will be a period of churn where finding the good and ignoring the bad is difficult, a badly made curator might make bad ideas linger.
That's just in the domain of games and applications. If AI could manage that level of complexity, you can ask it to develop and test just about any software idea you have.
I barely go a day without thinking of something that I could spend months of development time on.
Some idle thoughts that such a model could develop and test.
Can you make a transformer that instead of linear space V modifiers it instead used geodesics? Is it better? Would it better support scalable V values?
Can you train a model to identify which layer is the likely next layer purely based upon the input given to that layer? If it only occasionally gets it wrong does the model perform better if you give the input to the layer that the predictor thought was the next layer. Can you induce looping/skipping layers this way?
If you train a model with the layers in a round robin ordering on every input, do the layers regress to a mean generic layer form, or do they develop into a general information improver that works purely by the context of the input.
What if you did every layer on a round robin twice, so that every layer was guaranteed to be followed by any of the other layers at least once?
Given you can quadruple the parameters of a model without changing it's behavour using the Wn + Randomn, Wn - Randomn trick, can you distill a model
To .25 size then quadruple to make a model to retain the size but takes further learning better, broadening parameter use.
Can any of these ideas be combined with the ones above?
Imagine instead of having these idle ideas, you could direct an AI to implement them and report back to you the results.
Even if 99.99% of the ideas are failures, there could be massive advances from the fraction that remains.
My "pelican test" for coding LLMs now is creating a proof of concept building UIs (creating a hello world app) using Jetpack Compose in Clojure.
Since Compose is implemented as Kotlin compiler extensions and does not provide Java APIs, it cannot be used from Clojure using interop.
I outlined a plan to let it analyze Compose code and suggest it can reverse engineer bytecode of Kotlin demo app first and emit bytecode from Clojure or implement in Clojure directly based on the analysis. Claude Code with Sonnet 4 was confident implementing directly and failed spectacularly.
Then as a follow-up I tried to let it compile Kotlin demo app and then tried to bundle those classes using clojure tooling to at least make sure it gets the dependencies right as the first step to start from. It resorted to cheating by shelling out to graddlew from clojure :) I am going to wait for next round of SOTA models to burn some tokens again.
Doesn't Clojure already support all of those features ?
Eg.
> transducer-first design, laziness either eliminated or opt-in
You can write your code using transducers or opt-in for laziness in Clojure now. So it's a matter of choice of tools, rather than a feature of the language.
> protocols everywhere as much as practically possible (performance)
Again, it's a choice made by the programmer, the language already allows you to have protocols everywhere. It's also how Clojure is implemented under the hood.
-> first-class data structures/types are also CRDT data types, where practical (correctness and performance)
Most of the programs I worked on, did not require CRDT.
I'm inclined to choose a library for this.
> first-class maps, vectors, arrays, sets, counters, and more
Isn't this the case already ? If Clojure's native data structures are not enough, there's the ocean of Java options..
Which leads to a very interesting question:
How should the 'real' AGI respond to your request ?
> first-class maps, vectors, arrays, sets, counters, and more
That's my mistake; this line was intended to be a sub-bullet point of the previous line regarding CRDTs.
> the language already allows you to have protocols everywhere
The core data structures, for example, are not based on protocols; they are implemented in pure Java. One reason is that the 1.0 version of the language lacked protocols. All that being said, it remains an open question what the full implications of the protocol-first idea are.
> You can write your code using transducers or opt in for laziness in Clojure now. So it's a matter of choice of tools, rather than a feature of the language.
You 100% can. Unfortunately, many people don't. The first thing people learn is (map inc [1 2 3]), which produces a lazy sequence. Clojure would never change this behavior, as the authors value backward compatibility almost above everything else, and rightly so. A transducer-first approach would be a world where (map inc [1 2 3]) produces the vector [2 3 4] by default, for example.
This was mentioned by Rich Hickey himself in his "A History of Clojure" paper:
(from paper) > "Clojure is an exercise in tool building and nothing more. I do wish I had thought of some things in a different order, especially transducers. I also wish I had thought of protocols sooner, so that more of Clojure’s abstractions could have been built atop them rather than Java interfaces."
This is a good one. Forget AGI, I'd settle for an LLM that when doing Clojure doesn't spew hot trash. Balancing parens on tab complete would be a nice start. Or writing sensible ClojureScript that isn't reskinned JavaScript with parens would be pretty stellar.
Perhaps this is a really great AGI test - not in the sense that the AGI can complete the given task correctly, but if the AGI can interpret incredibly hand-wavy requirements with “do XXX (as much as possible)” and implement these: A,B,C etc
I get it now. Benchmarks, in the end, are prompts for AI researchers.
If you want a problem solved, translate it into an AGI benchmark.
With enough patience, it becomes something AI researchers report on, optimize for, and ultimately saturate. Months later, the solution arrives; all you had to do was wait. AI researchers are an informal, lossy form of distributed computation - they mass-produce solutions and tools that, almost inevitably, solve the messy problem you started with.
More AGI Final Frontiers:
"Reimplement Sid Meier's Alpha Centauri", but with modern graphics, smart AIs that role-play their personalities, all bugs fixed, a much better endgame, AI-generated unexpected events, and a dev console where you can mod the game via natural language instructions."
"Reimplement all linux command line utilities in Rust, make their names, arguments and options consistent, and fork all software and scripts on the internet to use the new versions."
Let's say we had a ChatGPT-2000 capable of all of this. How would digital life look like? What people would do with their computers?
Even if we were not past a hard takeoff point where AIs could decide for themselves what to work on, the things that would be created in all areas would be incredible.
Consider every time you played a game and thought it would be better if it had x,y, or z. Or you wished an application had this one simple nrw feature.
All those things would be possible to make. A lot of people will discover why their idea was a bad idea. Some will discover their idea was great, some will erroneously think their bad idea is great.
We will be inundated with the creation of those good and bad ideas. Some people will have ideas on how to manage that flood of new creations, and create tools to help out, some of those tools will be good and some of them will be bad, there will be a period of churn where finding the good and ignoring the bad is difficult, a badly made curator might make bad ideas linger.
That's just in the domain of games and applications. If AI could manage that level of complexity, you can ask it to develop and test just about any software idea you have.
I barely go a day without thinking of something that I could spend months of development time on.
Some idle thoughts that such a model could develop and test.
Can you make a transformer that instead of linear space V modifiers it instead used geodesics? Is it better? Would it better support scalable V values?
Can you train a model to identify which layer is the likely next layer purely based upon the input given to that layer? If it only occasionally gets it wrong does the model perform better if you give the input to the layer that the predictor thought was the next layer. Can you induce looping/skipping layers this way?
If you train a model with the layers in a round robin ordering on every input, do the layers regress to a mean generic layer form, or do they develop into a general information improver that works purely by the context of the input.
What if you did every layer on a round robin twice, so that every layer was guaranteed to be followed by any of the other layers at least once?
Given you can quadruple the parameters of a model without changing it's behavour using the Wn + Randomn, Wn - Randomn trick, can you distill a model To .25 size then quadruple to make a model to retain the size but takes further learning better, broadening parameter use.
Can any of these ideas be combined with the ones above?
Imagine instead of having these idle ideas, you could direct an AI to implement them and report back to you the results.
Even if 99.99% of the ideas are failures, there could be massive advances from the fraction that remains.
"Reimplement Linux in Rust" would be a good one!
My "pelican test" for coding LLMs now is creating a proof of concept building UIs (creating a hello world app) using Jetpack Compose in Clojure. Since Compose is implemented as Kotlin compiler extensions and does not provide Java APIs, it cannot be used from Clojure using interop.
I outlined a plan to let it analyze Compose code and suggest it can reverse engineer bytecode of Kotlin demo app first and emit bytecode from Clojure or implement in Clojure directly based on the analysis. Claude Code with Sonnet 4 was confident implementing directly and failed spectacularly.
Then as a follow-up I tried to let it compile Kotlin demo app and then tried to bundle those classes using clojure tooling to at least make sure it gets the dependencies right as the first step to start from. It resorted to cheating by shelling out to graddlew from clojure :) I am going to wait for next round of SOTA models to burn some tokens again.
Doesn't Clojure already support all of those features ?
Eg.
> transducer-first design, laziness either eliminated or opt-in
You can write your code using transducers or opt-in for laziness in Clojure now. So it's a matter of choice of tools, rather than a feature of the language.
> protocols everywhere as much as practically possible (performance)
Again, it's a choice made by the programmer, the language already allows you to have protocols everywhere. It's also how Clojure is implemented under the hood.
-> first-class data structures/types are also CRDT data types, where practical (correctness and performance)
Most of the programs I worked on, did not require CRDT. I'm inclined to choose a library for this.
> first-class maps, vectors, arrays, sets, counters, and more
Isn't this the case already ? If Clojure's native data structures are not enough, there's the ocean of Java options..
Which leads to a very interesting question:
How should the 'real' AGI respond to your request ?
> first-class maps, vectors, arrays, sets, counters, and more
That's my mistake; this line was intended to be a sub-bullet point of the previous line regarding CRDTs.
> the language already allows you to have protocols everywhere
The core data structures, for example, are not based on protocols; they are implemented in pure Java. One reason is that the 1.0 version of the language lacked protocols. All that being said, it remains an open question what the full implications of the protocol-first idea are.
> You can write your code using transducers or opt in for laziness in Clojure now. So it's a matter of choice of tools, rather than a feature of the language.
You 100% can. Unfortunately, many people don't. The first thing people learn is (map inc [1 2 3]), which produces a lazy sequence. Clojure would never change this behavior, as the authors value backward compatibility almost above everything else, and rightly so. A transducer-first approach would be a world where (map inc [1 2 3]) produces the vector [2 3 4] by default, for example.
This was mentioned by Rich Hickey himself in his "A History of Clojure" paper:
https://clojure.org/about/history https://dl.acm.org/doi/pdf/10.1145/3386321
(from paper) > "Clojure is an exercise in tool building and nothing more. I do wish I had thought of some things in a different order, especially transducers. I also wish I had thought of protocols sooner, so that more of Clojure’s abstractions could have been built atop them rather than Java interfaces."
The notion of when a language is created is open to interpretation.
It is not stated whether you want such a language described, specified, or implemented.
This is a good one. Forget AGI, I'd settle for an LLM that when doing Clojure doesn't spew hot trash. Balancing parens on tab complete would be a nice start. Or writing sensible ClojureScript that isn't reskinned JavaScript with parens would be pretty stellar.
Perhaps this is a really great AGI test - not in the sense that the AGI can complete the given task correctly, but if the AGI can interpret incredibly hand-wavy requirements with “do XXX (as much as possible)” and implement these: A,B,C etc