HarHarVeryFunny 16 hours ago

GPT 4.5 also has a knowledge cutoff date of 10-2023.

https://www.reddit.com/r/singularity/comments/1izpb8t/gpt45_...

I'm guessing that this model was finished pre-training at least a year ago (it's been 2 years since GPT 4.0 was released) and they just didn't see the hoped-for performance gains to think it warranted releasing at the time, and so put all their effort into the Q-star/strawberry = eventual O1 reasoning effort instead.

It seems that OpenAI's reasoning model lead isn't perhaps what they thought it was, and the recent slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7) made them feel the need to release something themselves for appearances sake, so they dusted off this model, perhaps did a bit of post-training on it for EQ, and here we are.

The price is a bit of a mystery - perhaps just a reflection of an older model without all the latest efficiency tricks to make it cheaper. Maybe it's dense rather than MoE - who knows.

  • sigmoid10 15 hours ago

    Rumors said that GPT4.5 is an order of magnitude larger. Around 12 trillion parameters total (compared to GPT4's 1.2 trillion). It's almost certainly MoE as well, just a scaled up version. That would explain the cost. OpenAI also said that this is what they originally developed as "Omni" - the model supposed to succeed GPT4 but which fell behind expectations. So they renamed it 4.5 and shoehorned it in to remain in the news among all those competitor releases.

    • glenstein 13 hours ago

      This is all excellent detail. Wondering if there's any good suggestions for further reading on the inside baseball of what happened with GPT 4.5?

      • qeternity 11 hours ago

        Well, it's not...it gets most details wrong.

        • glenstein 10 hours ago

          Can you elaborate?

          • qeternity 10 hours ago

            GPT-4 was rumored to be 1.8T params...not 1.2

            And the successor model was called "Orion", not "Omni".

            • glenstein 10 hours ago

              Appreciate the corrections, but I'm still a bit puzzled. Are they wrong about 4.5 having 12 trillion parameters, it originally intending to be Orion (not omni), or an expected successor to GPT 4? And do you have any related reading that speaks to any of this?

            • az226 7 hours ago

              GPT-4 was 1.3T. 221B active. 2 experts active. 16 experts total.

    • ljlolel 14 hours ago

      the gpt-4o ("omni") is probably a distilled 4.5; hence why not much quality difference

      • sigmoid10 14 hours ago

        4o has been out since May last year, while omni (now rechristened as 4.5) only finished training in October/November.

        • cubefox 14 hours ago

          4.5 was called Orion, not Omni.

    • qeternity 11 hours ago

      GPT-4 was rumored to be 1.8T params...not 1.2

      And the successor model was called "Orion", not "Omni".

    • zaptrem 8 hours ago

      You're thinking of "Orion" not "Omni" (GPT 4o stands for "Omni" since it's natively multimodal with image and audio input/output tokens)

    • Leary 14 hours ago

      How does this compare with Grok 3's parameter count? I know Grok 3 was trained on a larger cluster (100k-200k) but GPT 4.5 used distributed training.

  • glenstein 13 hours ago

    >The price is a bit of a mystery

    I think it at least is somewhat analogous to what happened with pricing on previous models. GPT 4, despite being less capable than 4o, is an order of magnitude more expensive, and comparably expensive to o1. It seems like once the model is out, the price is the price, and the performance gains emerge but they emerge attached to new minified variations of previous models.

  • simonw 14 hours ago

    I don't think th October 2023 training cut-off means the model finished pre-training a year ago. All of OpenAI's models share that same cut-off date.

    One theory is that they're worried about the increasing tide of LLM-generated slop that's been posted online since that date. I don't know if I buy that or not - other model providers (such as Anthropic, Gemini) don't seem worried about that.

  • Chance-Device 14 hours ago

    Releasing it was probably a mistake. In context what the model is could have been understood, but they haven’t really presented that context. Also it would be lost on a general audience.

    The general public will naturally expect it to be the next big thing. Wasn’t that the point of releasing it? To seem like progress is being made? To try to make that point with a model that doesn’t deliver is a misstep.

    If I were Sam Altman, I’d be pulling this back before it goes on general release, saying something like it was experimental and after user feedback the costs weren’t worth it and they’re working on something else as a replacement. Then o3 or whatever they actually are working on instead can be the “replacement” even if it’s much later.

    • datadrivenangel 11 hours ago

      or just say it was too good and thus too dangerous to release...

  • bilater 13 hours ago

    I sort of believed this but also 4.5 coming out last year would absolutely have been a big deal compared to what was out there at the time? I just dont understand why they would not launch it then.

  • numba888 11 hours ago

    > slew of strong non-reasoning models (Gemini 2.0 Flash, Grok 3, Sonnet 3.7)

    Sonnet 3.7 is actually reasoning model.

    • LaurensBER 10 hours ago

      It's my understanding that reasoning in Sonnet 3.7 is optional and configurable.

      I might be wrong but I couldn't find a source that indicates that the "base" model also implements reasoning.

      • amluto 6 hours ago

        From limited experimentation: Sonnet 3.7 has “extended thinking” as an option, although the UI, at least in the app, leaves something to be desired. It also has a beta feature called “Analysis” that seems to work by having the model output JavaScript code as part of its response that is then run and feeds back into the answer. Both of these abilities are visible — users can see the chain of thought and the analysis code.

        It seems, based again on limited experimentation doing sort-of-real work, that analysis works quite well and extended thinking is so-so. Whereas DeepSeek R1 seems to be willing and perhaps even encouraged to second-guess itself (maybe this is a superpower of the “wait” token”), Sonnet 3.7 doesn’t seem to second-guess itself as much as it should. It will happily extended-think, generate a wrong answer, and then give a better answer after being asked a question that it really should have thought of itself.

        (I’m not complaining. I’ve been a happy user of 3.7 for a whole day! But I think there’s plenty of room for improvement.)

tsunego 16 hours ago

GPT-4.5 feels like OpenAI's way of discovering just how much we'll pay for diminishing returns.

The leap from GPT-4o to 4.5 isn't a leap—it's an expensive tiptoe toward incremental improvements, priced like a luxury item without the luxury payoff.

With pricing at 15x GPT-4o, they're practically daring us not to use it. Given this, I wouldn't be surprised if GPT-4.5 quietly disappears from the API once OpenAI finishes squeezing insights (and cash) out of this experiment.

  • zamadatix 16 hours ago

    Even this is a bit overly complicated/optimistic to me. Why not something as simple as: OpenAI has been building larger and larger models to great success for a long time. As a result, they were excited this one was going to be so much larger=so much better that the price to run it would be well worth the huge jump they were planning to get from it. What really happened is this method of scaling hit a wall and they were left with an expensive dud they won't get much out of but they have to release something for now otherwise they start falling well behind on the boards the next few months. Meanwhile they scramble focus to find other means of scaling like "chain of thought + runtime" provided.

    • hn_throwaway_99 15 hours ago

      Thank you so much for this comment. I don't really understand the need for people to go straight to semi-conspiratorial hypotheses, when the simpler explanation makes so much more sense. All the evidence is that this model is much larger than previous ones, so they must charge a lot more for inference because it costs so much more to run. OpenAI were the OGs when it came to scaling, so it's not surprising they went this route and eventually hit a wall.

      I don't at all blame OpenAI for going down this path (indeed, I laud them for making expensive bets), but I do blame all the quote-un-quote "thought leaders" who were writing breathless posts about how AGI was just around the corner because things would just scale linearly forever. It was classic "based on historical data, this 10 year old will be 20 feet tall by the time he's 30" thinking, and lots of people called them out on this, and they either just ignored it or responded with "oh, simple not-in-the-know peons" dismissiveness.

      • bee_rider 14 hours ago

        It is weird because this is a board for working programmers for the most part. So like, who’s seen a gram conspiracy actually be accomplished? Probably now many. A lackluster product that gets released even though it sucks because too many people are highly motivated not to notice that it sucks? Everybody has experienced that, right?

        • glenstein 13 hours ago

          Exactly. Although I wouldn't even say they have blinders, it seems like OpenAI understands quite well what 4.5 can do and what it can't hence the modesty in their messaging.

          To your point, though, I would add not only who has seen any grand conspiracy actually be accomplished, who has seen one even attempted and kept under wraps? Such that the absence of corroborating sources was more consistent with an effectively executed conspiracy theory than the simple absence of such a plan.

      • danielbln 14 hours ago

        It works until it doesn't and hindsight is 20/20.

        • hn_throwaway_99 13 hours ago

          > It works until it doesn't

          Of course, that's my point. Again, I think it's great that OpenAI swung for the fences. My beef is again with these "thought leaders" who would write this blather about AGI being just around the corner in the most uncritical manner possible (e.g. https://news.ycombinator.com/item?id=40576324). These folks tended to be in one of two buckets:

          1. "AGI cultists" as I called them, the "we're entering a new phase of human evolution"-type people.

          2. People who had a motive to try and sell something.

          And it's not about one side or the other being "right" or "wrong" after the fact, it's that so much of this just sounded like magical thinking and unwarranted extrapolations from the get go. The actual experts in the area, if they were free to be honest, were much, much more cautious in their pronouncements.

          • danielbln 13 hours ago

            Definitely, the grifters and hypesters are always spoiling things, but even with a sober look it felt like AGI _could_ be around the corner. All these novel and somewhat unexpected emerging capabilities as we pushed more data through training, you'd think maybe that's enough? It wasn't and test time compute alone isn't either, but that's also hindsight to a degree.

            Either way, AGI or not, LLMs are pretty magical.

            • snovv_crash 12 hours ago

              If you've been around long enough to witness a previous hype bubble (and we've literally just come out of the crypto bubble), you should really know better by now. Pets.com, literally an online shop selling pet food, almost IPOd for $300M in early 2000, just before the whole dot-com bubble burst.

              And yeah, LLMs are awesome. But you can't predict scientific discovery, and all future AI capabilities are literally still a research project.

              I've had this on my HN user page since 2017, and it's just as true as ever: In the real world, exponentials are actually early stage sigmoids, or even gaussians.

          • baxtr 12 hours ago

            Well that's only because YOU don't understand exponential growth! No human can! /s

      • Kye 10 hours ago

        In fundamental science terms, it also proves once and for all that more model doesn't mean more better. Any forces within OpenAI pushing to move past just growing the model for gains now have a strong argument for going all-in on new processes.

  • TZubiri 16 hours ago

    Time to enter the tick cycle.

    I ask chatgpt to give me a map highlighting all spanish speaking countries, gives me stable diffusion trash.

    Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.

    It's a good LLM model already it doesn't need to be einstein and solve aerospatial equations. We just need to wait until they realize their limits and find the humility to build yet another useful product that won't conquer the world.

    • Willingham 16 hours ago

      I’ve thought of LLM’s as google 2.0 for some time now. Truly a world changing technology similar to how google changed the world, likely to have an even larger impact than google had as we create highly specialized Implementations of the technology in the coming decade…but it’s not energy positive nuclear fusion, or a polynomial time NP solver, it’s just google 2.0

      • dingnuts 15 hours ago

        Google 2.0 where you have to check every answer it gives you because it's authoritative about nothing.

        Works great when the output is small enough to unit test or immediately try in situations with no possible negative outcomes.

        Anything larger? Skip the LLM slop and go to the source. You have to go to the source, anyway.

        • Chilko 10 hours ago

          All while using far more energy than a normal google search

          • glenneroo 10 hours ago

            I keep wondering what the long-game (if any) of LLMs is... to make the world dependent on various models then jack the rates up to cover the costs? The gravy-train of SV funding has to end eventually... right?

        • CamperBob2 14 hours ago

          You have to go to the source, anyway.

          Yeah, and then check that. I don't get this argument at all.

          People who uncritically swallow the first answer or two they get from Google have a name... but that would just derail the thread into politics.

          • sillyfluke 14 hours ago

            There is a truth in the grandparent's comment that doesn't necessarily conflict with this view. The Google 2.0 effect is not necessarily that it gives you a better correct answer faster than google. I think it never dawned on people how bad they were at searching about topics they didn't know much about or how bad google was at pointing them in the right direction prior to chatgpt. Or putting it another way, they never realized how much utility they would get out of something that pointed them in the correct direction even though they couldn't trust the details.

            It turns out that going from not knowing what you don't know to knowing what you don't know adds an order of magnitude improvement to people's experience.

          • TZubiri 11 hours ago

            And the llm by design does not save or provide source. Unlike google or wikipedia which are transparent about sources.

            • CamperBob2 9 hours ago

              It most certainly does, if you are using the latest models, which people making comments like this never are as a rule.

          • LPisGood 14 hours ago

            There is something to be said for trusting people’s (or systems of people’s) authority.

            For example, have you ever personally verified that humans went to the moon? Have you ever done the experiments to prove the Earth is round?

            • sillyfluke 14 hours ago

              This is not a helpful phrasing I think. Sources allow the reader to go as far down the rabbit hole as they are willing to or knowledgable enough to go.

              For example, if I'm looking for some medical finding and I get to a source that's a clinical study from a reputable publication, I may be satisfied and stop there since this is not my area of expertise. However, a person with knowledge of the field may be able to parse the study and pick it apart better than I could. Hence, their search would not end there since they would be unsatisfied with just the source I was satisfied with.

              On the other hand, having no verifiable sources should leave everyone unsatisfied.

              • LPisGood 13 hours ago

                Of course, that verifiability is a big part of that trust. I’m not sure why you think my phrasing is not helpful; we seem to agree.

            • rwiggins 12 hours ago

              > Have you ever done the experiments to prove the Earth is round?

              I have, actually! Thanks, astronomy class!

              I've even estimated the earth's diameter, and I was only like 30% off (iirc). Pretty good for the simplistic method and rough measurements we used.

              Sometimes authorities are actually authoritative, though, particularly for technical, factual material. If I'm reading a published release date for a video game, directly from the publisher -- what is there to contest? Meanwhile, ask an LLM and you may have... mixed results, even if the date is within its knowledge cutoff.

            • Spooky23 12 hours ago

              Have you provided documentation that you are human? Perhaps you are a lizard person sowing misinformation to firm up dominance of humankind.

    • bee_rider 14 hours ago

      LLMs could make some nice little tools.

      However they’ll need to replace vast swathes of the economy to justify these AI companies’ market caps.

    • blharr 14 hours ago

      Giving ChatGPT stupid AI image generation was a huge nerf. I get frustrated with this all the time.

      • SketchySeaBeast 13 hours ago

        Oh, I think it's great they did that. It's super helpful for visualizing ChatGPT's limitations. Ask it for an absolutely full, overflowing glass of wine or a wrist watch whose time is 6:30 and it's obvious what it actually does. It's educational.

    • tiahura 12 hours ago

      I asked claude to give me a script in python to create a map highlighting all spanish speaking countries. it took 3 tries and then gave me a perfect svg and png.

    • dismalaf 9 hours ago

      > Just gotta do the grunt work, add a tool with a map api. Integrate with google maps for transit stuff.

      This is kind of the crux though. The only way to make LLMs more useful is to basically make them traditional AI. So it's not really a leap forward nevermind path to AGI.

  • hooverd 16 hours ago

    They should have called it "ChatGPT Enterprise".

    • tsunego 16 hours ago

      Exactly! designed specifically for people who love burning corporate budgets.

      • numba888 11 hours ago

        OpenAI is going to add it to Plus subscriptions. I.e. available for many at no additional cost. Likely with restrictions line N prompts/hour.

        As for API price, when it matters businesses and people are willing to pay much more for just a bit better results. OpenAI doesn't take the other options away. So we don't lose anything.

    • fodkodrasz 16 hours ago

      IMO the 4o output is lot more Enterprise-compatible, the 4.5 being straight to the point and more natural is quite the opposite. Pricing-wise your point stands.

      Disclaimer: have not tried 4.5 yet, just skimmed through the announcement, using 4o regularly.

  • Kerbonut 16 hours ago

    Apparently, OpenAI API “credits” expire after a year. I stupidly put another $20 and trying to blow through them, 4.5 is the easiest way considering recent 4o has fallen out of favor for other models and I don’t want to just let them expire again. An expiry after only one year is asinine.

    • Chance-Device 14 hours ago

      Yes. I also discovered this, and was also forced to blow through my credits in a rush. Terrible policy.

      • glenstein 13 hours ago

        I'm learning this for the first time now. I don't appreciate having to anticipate how many credits I'll use like its an FSA account.

      • heed 11 hours ago

        >Terrible policy.

        And unfortunately one not exclusive to OpenAI. Anthropic credits also expire after 1 year.

        • gloosx 43 minutes ago

          Not sure how it's with OpenAI, but Anthropic is so money-hungry, they won't even let you remove your debit card data from your account without a week-long support encounter.

  • jstummbillig 15 hours ago

    This is how pricing on human labour works. Nobody expects an employee that costs twice as much to produce twice the output for any given task. All that is expected is that they can do a narrow set of things, that another person can't.

neom 15 hours ago

4.5 can extremely quickly distill and work with what I at least consider, complex nuanced thought. 4.5 is night and day better than every other AI for my work, it's quite clever and I like it.

Very quick mvp comparison for the show me what you mean crew: https://chatgpt.com/share/67c48fcc-db24-800f-865b-c0485efd7f... & https://chatgpt.com/share/67c48fe2-0830-800f-a370-7a18586e8b... (~30 seconds vs ~3 minutes)

  • nyrikki 13 hours ago

    The 4.5 has better 'vibes' but isn't 'better', as a concrete example:

    > Mission is the operationalized version of vision; it translates aspiration into clear, achievable action.

    The "Mission is the operationalized version of vision" is not in the corpus that I am find and is obviously a confabulated mixture of classic Taylorist like "strategic planning"

    SOPs and metrics, which will be tied to compensation and the unfortunate ubiquitous nature of Taylorism would not result in shared purpose, but a bunch of Gantt charts past the planning horizon.

    IMHO I would consider "complex nuanced thought" as understanding the historical issues and at least respect the divide between classical and neo-classical org theory. Or at least avoid pollution of more modern theories with classical baggage that is a significant barrier to delivering value.

    Mission statements need to share strategic intent in an actionable way, strategy is not operationalization.

    • ewoodrich 10 hours ago

      I have been experimenting with 4.5 for a journaling app I am developing for my own personal needs, for example, turning bullet/unstructured thoughts into a consistent diary format/voice.

      The quality of writing can be much better than Claude 3.5/3.7 at times but struggling with similar confabulation of information that is not in the original text but "sounds good/flows well". Which isn't ideal for a personal journal... I am still playing around with the system prompt but given the astronomical cost (even with me as the only user) with marginal benefits I am probably going to end up sticking with Claude for now.

      Unless others have a recommendation for a less robot-y sounding model (that will, however, follow instructions precisely) with API access other than the mainstream Claude/OpenAI/Gemini models?

      • neom 10 hours ago

        I've found this on par with 4.5 in tone, but not as nuanced in connecting super wide ideas in systems, 4.5 still does that best: https://ai.google.dev/gemini-api/docs/thinking

        (also: the person you are responding to is doing exactly what you're saying you don't want done, take something unrelated to the original text (Taylorism) but could sound good, and jam it in)

    • neom 13 hours ago

      The statement "Mission is the operationalized version of vision; it translates aspiration into clear, achievable action" isn't a Taylorist reduction of mission to mechanical processes - it's actually a nuanced understanding of how these organizational elements relate. You're misinterpreting what "operationalized" means in this context. From what i can tell, the 4.5 response isn't suggesting Taylorist implementation with Gantt charts etc it's describing how missions translate vision into actionable direction while remaining strategic. Instead of jargon, it's recognizing that founders need something between abstract vision and tactical execution. Missions serve this critical bridging function. CEO has vision, orgs capture the vision into their missions, people find their purpose when aligned via the 2. Without it, founders either get stuck in aspirational thinking or jump straight to implementation details without strategic guidance. The distinction matters exactly because it helps avoid the dysfunction that prevents startups from scaling effectively. I think you're assuming "operationalized" means tactical implementation (Gantt charts, SOPs) when in this context it means "made operational/actionable at a strategic level". Missions != mission statements. Also, you're creating a false dichotomy between "strategic intent" and "operationalization" when they very much, exist on a spectrum. (If anything, connecting employees to mission and purpose is the opposite of Tayloristic thinking, which viewed workers more as interchangeable parts than as stakeholders in a shared mission towards responding to a shared vision of global change) - You are doing what o1 pro did, and as I said: As a tool for teaching business to founders, personally, I find the 4.5 response to be better.

      • nyrikki 12 hours ago

        An example of a typical nieve definition of a mission statement is:

        Concise, clear, and memorable statement that outlines a company's core purpose, values, and target audience.

        > "made operational/actionable at a strategic level".

        Taken the common definition from the first part of this plan, what do you think the average manager would do given that in the social sciences, operationalization is explicitly about measuring abstract qualities. [1]

        "operationalization" is a compromise, trying to quantify qualitative properties, it is not typically subject to methods like MECE principal, because there are too many unknown unknowns.

        You are correct that "operationalization" and "strategic intent" are not mutually exclusive in all aspects, but they are for mission statements that need to be durable across changes that no CEO can envision.

        The "made operational/actionable at a strategic level" is the exact claim of pseudo scientific management theory (Greater Taylorism) that Japan directly targeted to destroy the US manufacturing sector. You can look at the former CEO of Komatsu if you want direct evidence.

        GM:s failure to learn form Toyota at NUMII (sp?) is another.

        The planning process needs to be informed by stratagy, but planning is not strategic, it has a limited horizon.

        But you are correct that it is more nuanced and neither Taylor nor Tolstoy allowed for that.

        Neo-classical org theory is when bounded rationality was first acknowledged, although the Prussian military figured that out long before Taylor grabbed his stopwatch to time people loading pig iron into train cars.

        I encourage you to read:

        Strategy: A History By sir Lawrence Freedman

        For a more in depth discussion.

        [1] https://socialsci.libretexts.org/Bookshelves/Sociology/Intro...

        • neom 12 hours ago

          Your responses are interesting because they drive me to feel reinforced about my opinion. This conversation is precisely why I rate 4.5 over o1 pro. I prompted in a very very very specific way. I'm afraid to say your comments are highly disengaged for the realities of business and business building. Appreciate the historical context and recommended reading (although I assure you, I am extremely well versed). The term 'operationalized' here refers to strategic alignment, not Taylorist quantification, think guiding principles over rigid metrics. You are badly conflating operationalization in social sciences (which is about measurement) with strategic operationalization in management, which is not same. Again: operationalized in this context means making the mission actionable at a strategic level, not about quantification. Modern mission frameworks prioritize adaptability within durable purpose, avoiding the pitfalls you’ve rightly flagged. Successful founders don't get caught in these theoretical distinctions. Founders taught be my, and I guess GPT 4.5, understand correctly, mission as the bridge between aspirational vision and practical action. This isn't "Greater Taylorism" but pragmatic leadership. While your historical references (NUMMI, not NUMII) demonstrate academic knowledge, they miss how effective missions actually guide organizations while remaining adaptable. The 4.5 response captured this practical reality well- it pointed to but it not create artificial boundaries between interconnected concepts. If we had some founders trained by you (o1 Pro) and me (Gpt 4.5) - I would be willing to bet my founders would out preform yours any day of the week.

          • nyrikki 10 hours ago

            Tuckman as a 'real' framework is a belief so that is fair.

            He clearly communicated in 1977 that his ideas were never formally validated and that he cautioned about their use in other contexts.

            I think that the concepts can be useful, if you took them as anything more than a guiding framework that may or may not be appropriate for a particular need.

            https://core.ac.uk/download/pdf/36725856.pdf

            I personally find value in team and org mission statements, especially for building a shared purpose, but to be honest, any of the studies on that are more about manager satisfaction then anything else.

            There is far more data on the failure of strategy execution, and linking strategy with purpose as well as providing runways and goals is one place I find vision and mission statements useful.

            As up to 90% of companies fail on strategy execution, and because employee engagement is in free fall, the fact that companies are still in business means little.

            Context is king, and this is horses for courses, but I would caution against ignoring more recent, Nobel winning theories like Holmström's theorem.

            Most teams don't experience the literal steps Tuckman suggested, rarely all at once, and never as one time singular events. As the above link demonstrated, some portions like the storming can be problematic.

            Make them operationalize their mission statement, and they will and it will be in concrete.

            Remember von MoltKe "No plan of operations extends with certainty beyond the first encounter with the enemy's main strength."

            There is a balance between C2 and mission command styles, the risk is trying to force or worse intentionally causing people to resort to c2 when almost always you need a shifting balance between command and intent based solutions.

            The Feudal Mode of Production was sufficient for centuries, but far from optimal.

            The NUMMI reference was exactly related to the same reason Amazon profits historically raised higher despite head count increases that should have allowed.

            Small cross functional teams, with clearly communicated tasks, and enough freedom to accomplish those tasks efficiently.

            You can look at Trist's study about the challenges with incentivizing teams to game the system. Same problem happened under Balmer at MS, and DEC failed the opposite way, trying to do everything at once and please everyone.

            https://www.uv.es/=gonzalev/PSI%20ORG%2006-07/ARTICULOS%20RR...

            The reality is that the popularity of frameworks rarely relates to their effectiveness, building teams is hard, making teams work as teams across teams is even harder.

            Tuckerman may be useful in that...but this claim is wrong:

            > "Modern mission frameworks prioritize adaptability within durable purpose, avoiding the pitfalls you’ve rightly flagged"

            Modern _ frameworks prioritize adoption and depending on the framework to solve your companies needs will always fail. You need to choose a framework that fits your strategy and objectives, and adapt it to fit your needs.

            Learn from others, but don't ignore the reality on the ground.

            • neom 9 hours ago

              Regarding Tuckman's model, there are actually numerous studies validating its relevance and practical application: Gren et al. (2017) validated it specifically for agile teams across eight large companies. Natvig & Stark (2016) confirmed its accuracy in team development contexts. Bonebright's (2010) historical review demonstrated its ongoing relevance across four decades of application.

              I feel we're talking past each other here. My original point was about which AI model is better for MY WORK. (I run a starup accelerator for first time founders) 4.5, in 30 seconds over minutes, provided more practical value to founders building actual businesses, and saved me time. While I appreciate your historical references and academic perspectives, they don't address my central argument about GPT-4.5's response being more pragmatically useful. The distinction between academic precision and practical utility is exactly what I'm highlighting. Founders don't need perfect theoretical models - they need frameworks that help them bridge vision and execution in the real world. When you bring up feudal production modes and von Moltke, we're moving further from the practical question of which AI response would better guide someone trying to align teams around a meaningful mission that drives business results. It's exactly why I formed the 2 prompts in the manner I did, I wanted to see if it was an academic or an expert.

              My assessment stands that GPT-4.5's 30 seconds of thinking reflects well how mission operationalizes vision reflects how successful businesses actually work, not how academics might describe them in theoretical papers. I've read the papers, I've studied the theory deeply, but I also have NYSE and NASDAQ ticker symbols under my belt, from seed. That, is the whole point here.

              • nyrikki 7 hours ago

                OK maybe we are using different meanings of the word "operationalize"

                If I were say in middle management and you asked me to "operationalize" the impact of mission statements, I would try to associate the existence of a mission statement on a team to some metric like financial performance.

                If I was on a small development team and you asked me to "operationalize" our mission statement, I would probably make the same mistake the software industry always does, like trying it to tickets closed, lines of code, or even the Dora metrics.

                Under my understanding of "operationalize" and the only way I can find it referenced related to mission statements themselves, I would actually de-emphasize deliverables, quality, stakeholders changing demands etc...

                Even if I try to "operationalize" in a more abstract way, like define an impact score, which may not directly map to business objectives or even team building.

                Almost every LLM offers a similar definition I offered above E.G.

                > "operationalization" refers to the process of defining an abstract concept in a way that allows it to be measured and observed through specific, concrete indicators

                Impact scores, which are subjective can lead to Google's shelfware problems, and even scrum rituals often leads to hard but high value tasks being ignored because of the incentives don't allow for it.

                In both of your cites, they were situations where existing cultures were enhanced, not fully replaced.

                Both were also short term, and wouldn't capture the long tail problems I am referencing.

                Heck even Taylorism worked well for the auto industry until outside competition killed it. Well at least for the companies, consumers suffered.

                The point is that "operationalization" specifically is counterproductive under a model, where infighting during that phase is bad.

                If you care about delivering on execution, it would seems to be important to you. But I realize that you may not be targeting repeat work...I just don't know.

                But I am sure some McKinsey BA probably has put that consern in a PDF someplace by now because the GoA Agile assessment guide is being incorporated and even ITIL and TOGAF reference that coal face paper I cited.

                The BCGs and McKinseys of the world are absolutely going to shift to detection of common confabulations to show value.

                While I do take any tools possible to make me more productive, correctness of content concerns me more than exact verbage.

                But yes, different needs, I am in the nitch of rescuing failed initiatives, which admittedly is far from the typical engagement style.

                To be honest the lack of scratch space on 4.5 compared to CoT models is the main blocker for me.

  • ttul 14 hours ago

    I believe 4.5 is a very large and rich model. The price is high because it's costly to inference; however, the bigger reason is to ensure that others don't distill from it. Big models have a rich latent space, but it takes time to squeeze the juice out.

    • esafak 14 hours ago

      That also means people won't use it. Way to shoot yourself in the foot.

      The irony of a company that has distilled the word's information complaining about another company distilling their model...

      • ttul 12 hours ago

        The small number of use cases that do pay are providing gross margins as well as feedback that helps OpenAI in various ways. I don’t think it’s a stupid move at all.

      • cscurmudgeon 14 hours ago

        My assumption: There will be use cases where cost of using this will be smaller than the gain from it. Data from this will make the next version better and cheaper.

phillipcarter 15 hours ago

My take from using it a bit is that they seem to have genuinely innovated on:

- Not writing things that go off in weird directions / staying grounded in "reality"

- Responding very well to tone preferences and catching nuance in what I say

It seems like it's less that it has a great "personality" like Claude, but that it's capable of adapting towards being the "personality" I want and "understanding" what I'm saying in ways that other models haven't been able to do for me.

  • XenophileJKO 13 hours ago

    So this kind of mirrors my feelings after using GPT-4.5 on general conversation and song writing.

    GPT picked up on unspecified requirements almost instantly. It is subtle (and may be undesirable in some contexts). For example in my songs, I have to bracket the section headings, it picked up on that from my original input. All the other frontier models generally have to be reminded. Additionally, I separately asked for an edit to a music style description. When I asked GPT-4.5 to write a song all by itself, it included a music style description. No other model I have worked with has done this.

    These are subtle differences, but in aggregate the model just generally needs less nudging to create what is required.

    • torginus 9 hours ago

      I haven't used 4.5 but have some experience using Claude for creative writing, and in my experience it sometimes has the uncanny ability to get to the core of my ideas, rephrasing my paragraph long descriptions into just a sentence or two, or both improving and concretizing my vague ideas into something that's insightful and tasteful.

      Other times it locks itself into a dull style and ignores what I ask of it and just produces boring generic garbage, and I have to wrangle it hard to get some of the spark back.

      I have no idea what's going on inside, but just like with Stable Diffusion, it's fairly easy to make something that has the spark of genius, and is very close to being perfect, but getting the last 10% there, and maintaining the quality seems almost impossible.

      It's a very weird feeling, it's hard to put into words what is exactly going on, and probably even harder to make it into a benchmark, but it makes me constantly flip-flop between scared of being how good the AI is, and questioning why I ever bothered with using it in the first place, as I would've progress much faster without it.

pzo 16 hours ago

Long term it might be hard to monetise those infrastructure considering their competition:

1) For coding (API) most probably will stick to Claude 3.5 / 3.7 - big market but still small comparing to all world wide problems

2) For non-coding API IMHO gemini 2.0 flash is the winner - dirty cheap (cheaper than 4o-mini), good enough and even better than gpt-4o, cheap audio and image input.

3) For subscription app ChatGPT is probably still the best but only slightly - they have the best advanced voice audio conversation but Grok will be probably eating their lunch here

  • anukin 15 hours ago

    Sesame model for voice audio imo is better than ChatGPT voice audio conversation. They are going to open source it as well.

    • bckr 15 hours ago

      Sure but is there an app I can talk to / work with? It seems they're a voice synthesis model company, not a chatbot app / tool company.

    • OsrsNeedsf2P 14 hours ago

      > They are going to open source it as well.

      Means nothing until they do

  • Layvier 10 hours ago

    We were using gpt-4o for our chat agent, and after some experiments I think we'll move to flash 2.0. Faster, cheaper and a bit more reliable even. I also experimented with the experimental thinking version, and there a single node architecture seemed to work well enough (instead of multiple specialised sub agents nodes). It did better than deepseek actually. Now I'm waiting for the official release before spending more time on it.

  • ipaddr 15 hours ago

    For the rest of us using free tiers ChatGPT is hands down the winner allowing limited image generation, unlimited usage of some model and limited usage of 4o.

    Claude is still stuck at 10 messages per day and gemini is less accurate/useful.

    • dingnuts 15 hours ago

      10 messages a day? How are people "vibe coding" with that?

      • irishloop 15 hours ago

        They're paying for Pro

        • dingnuts 15 hours ago

          Ah thank you; I had heard the paid ones had daily limits too so I was confused

          • danielbln 14 hours ago

            They do, I subscribe to pro. All of my vibe coding however is done via the API.

siva7 16 hours ago

It's marketed to be slightly better at "creative writing". This isn't the problem most businesses have with current-generation LLMs. On the other side; Anthropic releases nearly at the same time a new model which solves more practical problems for businesses to the point that for coding many insiders don't use OpenAI models for such tasks anymore.

  • dingnuts 15 hours ago

    I think it should be illegal to trick humans into reading "creative" machine output.

    It strikes me as a form of fraud that steals my most precious resources: time and attention. I read creative writing to feel a human connection to the author. If the author is a machine and this is not disclosed, that's a huge lie.

    It should be required that publishers label AI generated content.

    • Hoasi 11 hours ago

      > I think it should be illegal to trick humans into reading "creative" machine output.

      Creativity has lost its meaning. Should it be illegal? The courts will take a long time to settle the matter. Reselling people's work against their will as creative machine output seems unethical, to say the least.

      > It should be required that publishers label AI-generated content.

      Strongly agree.

    • CuriouslyC 14 hours ago

      I'm pretty sure you read for pleasure, and feeling a human connection is one way that you derive pleasure. If it's the only way that you derive pleasure from reading, my condolences.

      • becquerel 14 hours ago

        Pretty much where my thoughts on this are. I rarely feel any particular sense of connection to the actual author when I read their books. And I have taken great pleasure from some AI stories (to the degree I put them up on my personal website as a way to keep them around).

    • Philpax 14 hours ago

      Under the dingnuts regime, Dwarf Fortress will be illegal. Actually, any game with a procedural story? You better believe that's a crime: we can't have a machine generate text a human might enjoy.

      • glenneroo 11 hours ago

        Dingnuts point was that it should be disclosed. Everyone knows Dwarf Fortress stories are procedural/AI generated, the authors aren't trying to hide that fact.

        • Philpax 11 hours ago

          Actually, fair enough. I still disagree with their argument, but this was the wrong tack for me to use.

kubb 16 hours ago

Seems like we're hitting the limits of the technology...

  • gloosx 24 minutes ago

    Consumer limits. When something is good enough it can make stable money, then there is no real incentive to innovate beyond the bare minimum—just enough to keep consumers engaged, shareholders satisfied, and regulators at bay.

    This is how modern capitalism and all corporations work, we will keep receiving new numbers in versions without any sensible change – consumers will keep updating their subscriptions out of habit – xyzAI PR managers, HR managers, corporate lawyers and myriads of other bureaucrats will keep receiving their paychecks secretly dreaming of retirement, xyzAI top management will burn money on countless acquisitions just to fight boredom, turning into xyz(ai)MEGAcorp doing everything from crude oil processing and styrofoam cups to web services and AI models.

    No modern mega corporation is capable of making something else or different from what already worked for them just once. We could achieve universal wellfare and prosperity 60 years ago, that would’ve disrupted the cycle. Instead, we got planned obsolescence, endless subscription models, and a world where everything “new” is just a slightly repackaged version of last year’s product.

  • xmichael909 15 hours ago

    Yes, I believe the sprint is over, now its doing to be slow cycles maybe 18 months to see a 5% increase an ability and even that 5% increase will be highly subjective. Claude's new release is about the same 3.7 is arguably worse at some things than 3.5 and better at others. Based on the previous pace of release in about 6 months or so - if the next release from any of the leaders is about the same "kinda better kinda worse" then we'll know. Imagine how much money is going to evaporate from the stock market if this is the limit!!!

    • ANewFormation 14 hours ago

      You can keep getting rich off shovels long after the gold has run dry.

    • borgdefenser 5 hours ago

      To say 3.7 is worse is completely insane.

  • TIPSIO 15 hours ago

    I also hate waiting on reasoning.

    I much would prefer a super lightning fast model that is cheaper but the same quality as these frontier models.

    Let me query these things to death.

  • apwell23 14 hours ago

    does it mean we get a reprieve from "this is just the beginning" comments.

    • thfuran 11 hours ago

      Maybe if it takes many years before the next major architectural advancement.

    • kubb 14 hours ago

      I wouldn't count on it.

ghostly_s 14 hours ago

I don't get it. Aren't these two sentences in the same paragraph contradictory?

>"Scaling to this size of model did NOT make a clear jump in capabilities we are measuring."

> "The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great."

  • XenophileJKO 13 hours ago

    No, it means that it got better on things orthogonal to what we have mostly been measuring. On the last few rounds, we have been mostly focusing on reasoning, not as much on knowledge, "creativity", or emotional resonance.

    • johnecheck 9 hours ago

      "It's better. We can't measure it, but we're pretty sure it's better. We also desperately need it to be better because we just spent a boat-load of money on it."

mirekrusin 12 hours ago

Is somebody actually looking at those last percentages on benchmarks?

Aren't we making mistake of assuming benchmarks are purely 100% correct?

sunami-ai 14 hours ago

Meanwhile all GPT4o models on Azure are set to be deprecated in May and there are no alternative models yet. We should start moving to Anthropic? DS too slow, melting under its own success. Anyone on GPT4o/Azure has any idea when they'll release the next "o" model?

  • Uvix 14 hours ago

    Only an older version of GPT-4o has been deprecated and will be removed in May. The newest version will be supported through at least 20 November 2025.

    https://learn.microsoft.com/en-us/azure/ai-services/openai/c...

    • sunami-ai 11 hours ago

      The Nov 2024 release, which is due to be deprecated in Nov 2025, I was told has degraded performance compared to the Aug 2024 release. In fact, OpenAI Models page says their current GPT4o API is serving the Aug release. https://platform.openai.com/docs/models#gpt-4o

      So I'm still on the Aug 24 release, which, with your reminding me, is not to be deprecated till Aug 2025, but that's less than 5 months from now, and we're skipping the Nov 2024 release just as OpenAI themselves have chosen to do.

EcommerceFlow 15 hours ago

I've found 4.5 to be quite good at "business decisions", much better than other models. It does have some magic to it, similar to Grok 3, but maybe a bit smarter?

yimby2001 16 hours ago

It seems like there’s a misunderstanding as why this happened. They’ve been baking this model for months. long before deep seek came out with fundamental new ways of distilling models. and even given that it’s not great it’s its large form, they’re going to distil from this going forward .. so it likely makes sense for them to periodically train these very large models as a basis.

  • lhl 15 hours ago

    I think this framing isn't quite right either. DeepSeek's R1 isn't very different from what OpenAI has already been doing with o1 (and that other groups have been doing as well). As for distilling - the R1 "distilled" models they released aren't even proper (logit) distillations, but just SFTs, not fundamentally new at all. But it's great that they published their full recipes and it's also great to see that it's effective. In fact we've seen now with LIMO, s1/s1.1, that even as few as 1K reasoning traces can get most LLMs to near SOTA math benchmarks. This mirrors the "Alpaca" moment in a lot of ways (and you could even directly mirror say LIMO w/ LIMA).

    I think the main takeaway of GPT4.5 (Orion) is that it basically gives a perspective to all the "hit a wall" talk from the end of last year. Here we have a model that has been trained on by many accounts 10-100X the compute of GPT4, is likely several times larger in parameter count, but is only... subtly better, certainly not super-intelligent. I've been playing around w/ it a lot the past few days, both with several million tokens worth of non-standard benchmarks and talking to it and it is better than previous GPTs (in particular, it makes a big jump in humor), but I think it's clear that the "easy" gains in the near future are going to be figuring out how as many domains as possible can be approximately verified/RL'd.

    As for the release? I suppose they could just have kept it internally for distillation/knowledge transfer, so I'm actually happy that they released it, even if it ends up not being a really "useful" model.

mvkel 14 hours ago

I think this release is for the researchers who worked on it and would quit if it never saw daylight

sampton 16 hours ago

Too much money not enough new ideas.

modeless 16 hours ago

I've been using 4.5 instead of 4o for quick questions. I don't mind the slowness for short answers. I feel like it is less likely to hallucinate than other models.

ein0p 14 hours ago

I have access to it. It is better, but not where most techies would care. It knows more, it writes better, it's more pleasant to talk to. I think they might have studied the traffic their hundreds of millions of users generate and realized where they need to improve, then did exactly that for their _non thinking_ model. They understand that a non-thinking model is not going to blow the doors off on coding no matter what they do, but it can do writing and "associative memory" tasks quite well, and having a lot more weights helps there. I also predict that they will fine tune their future distilled, thinking models for coding, based on the same logic, distilling from 4.5 this time. Those models have to be fast, and therefore they have to be smaller.

retskrad 16 hours ago

Sam Altman views Steve Jobs as one of his inspirations (he called the iPhone the greatest product of all time). So if you look at OpenAI in the lens of Apple, where you think about making the product enjoyable to use at all costs, then it makes perfect sense why you’d spend so much money to go from 4o to 4.5 which brings such subtle differences to power users.

The vast majority of users, which are over 300 million weekly, will mainly use 4o and whatever is the default. In the future they’ll use 4.5 and think it’s most human like and less robotic.

  • bravura 15 hours ago

    Yes but Steve Jobs also understood the paradox of choice, and the importance of having incredibly clear delineation between every different product in your line.

    • ipaddr 15 hours ago

      Do models matter to the regular user over brand? People talk about using chatGPT over Google's AI or Deepseek not 4o-mini vs gemini 2.

      OpenAI has done a good job of making the model less important and the domain gptGPT.com more important.

      Most of the time the model rarely matters. When you find something incorrect you may switch models but that rarely fixes the problem. Rewording a prompt has more value than changing a model.

      • esafak 14 hours ago

        If the model did not matter they would be spending their money on marketing or sales instead of improving the model.

        • ipaddr 8 hours ago

          Spending or saying they are spending is marketing but when people use their product the model doesn't matter.

buyucu 16 hours ago

OpenAI has been irrelevant for a while now. All of the new and exciting developments on AI are coming from other places. ClosedAI is no longer the driver of change and innovation.

  • jug 15 hours ago

    I think OpenAI is currently in this position where they are still industry standard, but also not leading. Deepseek R1 beat o1 on perf/cost with similar perf at a fraction of the cost. o3-mini is judged as ”weird” and quite hit and miss on coding (basically the sole reason for its existence) with a sky high SimpleQA hallucination rate due to its limited scope, probably beat by Sonnet 3.7 by a fairly large margin.

    Still, being early with a product and still often ”good enough” still takes them a long way. I think GPT-5 and where their competition will be then will be quite important for OpenAI though. I think the signs on the horizon is that everyone will close up on each other as we hit the diminishing returns, so the underlying business model, integrations, enterprise reach, marketing and market share will probably be king rather than the underlying LLM in 2026.

    Since GPT-5 is meant to select the best model behind the scenes, one issue might be that users won’t have the same confidence in the model, feeling like it’s deciding for them or OpenAI tuning it to err on the side of being cheap.

  • mrcwinn 16 hours ago

    That’s quite a world you’ve constructed!

  • nickthegreek 16 hours ago

    The other models are literally distilling OpenAI’s models into theirs.

    • demosthanos 16 hours ago

      So it's been claimed, but has it been proven yet?

      I'm not even sure what is being alleged there—o1's reasoning tokens are kept secret precisely to avoid the kind of distillation that's being alleged. How can you distill a reasoning process given only the final output?

      • ipaddr 15 hours ago

        The outputting that they are chatGPT from deepseek is a big clue.

    • orbital-decay 16 hours ago

      Do they? Why doesn't this happen to Claude then? I've been hearing this for a while, but never saw any evidence beyond the contamination of the dataset with GPT slop that is all over the web. Just by sending anything to the competitors you're giving up a large part of your know how before you even finish your product, that's a big incentive against doing that.

      • nickthegreek 16 hours ago

        Who said it isn’t happening to Claude?

        Companies are 100% using these big players to generate synthetic data. Distillation is extremely powerful. How is this even in question?

        • nullc 13 hours ago

          OpenAI conceals probabilities so how is anyone distilling from it?

    • kossTKR 16 hours ago

      And OpenAI based their tech on a Google paper again building on years of public academic research so what's the point exactly here?

      OpenAI was just first out of the gates, there'll always be some company that's first, essence is how they handle their leadership, and they've sadly been absolutely terrible and scummy.

      Actually i think Google was a pretty good example of the exact opposite, decades of "actually not being evil", while openAI switched up 1 second after launch.

      • ipaddr 15 hours ago

        Google wasn't the first search engine but they were the best marketing google = search. That's where we are with openai. Google search was a better product at the time and chatGPT 3.5 was a breakthrough the public used. Fast forward and some will say Google isn't the best search engine anymore (kagi, duckduckgo, yandex offer different experiences) but people still think of google=search. Same with chatGPT. Claude may be better for coding or gemini better are searching or Deepseek cheaper but equal but chatGPT is a verb and will live on like Intel inside long after it's actual value has declined.

        • williamcotton 14 hours ago

          Google was so much better than AltaVista that I just can’t buy that it was marketing that pushed them to the forefront of search.

          • ipaddr 8 hours ago

            Having a good product is marketing and parallels chatgpt

        • Jweb_Guru 13 hours ago

          > Google wasn't the first search engine but they were the best marketing google = search

          Google's overwhelming victory in search had ~ nothing to do with marketing.

          • ipaddr 8 hours ago

            Ever heard the term to google something. Viral marketing is still marketing.

            • kbelder 21 minutes ago

              That happened long after they completely dominated search. They succeeded because of quality, and because of how low quality all the other engines were.

              There was a time when Google was thought of as a respectable, high-quality, smart and nimble company. That has faded as the marketing grew.

      • nickthegreek 16 hours ago

        > so what’s the point exactly here.

        What is your point? OpenAI wasn’t the first out of that gate as your own argument cites Google prior. All these companies are predatory, who is arguing against that? OP said OpenAi was irrelevant. That’s just dumb. They are not. Feel free to advance an argument in favor of that narrative if you wish as I was just trying to provide a single example that shows that some of these lightweight models are building directly off the backs of giants spending the big money. I find nothing wrong with distillation and am excited about companies like DeepSeek.

Artgodma 14 hours ago

No general model is the frontier.

Thousands of small, specific models are infinitely more efficient than a general one.

The more narrowed the task - the better algorithms work.

That's obvious.

Why are general models pushed so hard by its creators?

Their enormous valuations are based on total control over user experience.

This total control is justified by computational requirements.

Users can't run general models locally.

Giant data centers for billions are the moat for Model creators and corporations behind.

  • azan_ 14 hours ago

    It's neither obvious nor true, generalist models outperform specialized ones all the time (so frequently that it even has its own name - the bitter lesson)

  • maleldil 14 hours ago

    Certain desirable capabilities are available only in bigger models because it takes a certain size for some behaviours emerge.