Ask HN: Resources for general purpose GPU development on Apple's M* chips?

149 points by thinking_banana 6 days ago

While Apple M* chips seems to have an incredible unified memory access, the available learning resources seem to be quite restricted and often convoluted. Has anyone been able to get past this barrier? I have some familiarity with general purpose software development with CUDA and C++. I want to figure how to work with/ use Apple's developer resources for general purpose programming.

aleinin 6 days ago

If you're looking for a high level introduction to GPU development on Apple silicon I would recommend learning Metal. It's Apple's GPU acceleration language similar to CUDA for Nvidia hardware. I ported a set of puzzles for CUDA called GPU-Puzzles (a collection of exercises designed to teach GPU programming fundamentals)[1] to Metal [2]. I think it's a very accessible introduction to Metal and writing GPU kernels.

[1] https://github.com/srush/GPU-Puzzles

[2] https://github.com/abeleinin/Metal-Puzzles

  • dylan604 6 days ago

    After a quick scan through the [2] link, I have added this to the list of things to look into in 2025

    • Jiahang 6 days ago

      Curious about the others in your list

  • singlepaynews 6 days ago

    Can anyone recommend a CUDA equivalent of (2)? That’s a spectacular learning resource and I’d like to use a similar one to upskill for CUDA

    • dagmx 6 days ago

      Isn’t the link right before it exactly what you’re asking for? Since 2 is a port of 1

morphle 6 days ago

You can help with the reverse engineering of Apple Silicon done by a dozen people worldwide, that is how we find out the GPU and NPU instructions[1-4]. There is over 43 trillion float operations per second to unlock at 8 terabit per second 'unified' memory bandwidth and 270 gigabits per second networking (less on the smaller chips)....

[1] https://github.com/AsahiLinux/gpu

[2] https://github.com/dougallj/applegpu

[3] https://github.com/antgroup-skyward/ANETools/tree/main/ANEDi...

[4] https://github.com/hollance/neural-engine

You can use a high level APIs like MLX, Metal or CoreML to compute other things on the GPU and NPU.

Shadama [5] is an example programming language that translates (with Ometa) matrix calculations into WebGPU or WebGL APIs (I forget which). You can do exactly the same with the MLX, Metal or CoreML APIs and only pay around 3% overhead going through the translation stages.

[5] https://github.com/yoshikiohshima/Shadama

I estimate it will cost around $22K at my hourly rate to completely reverse engineer the latest A16 and M4 CPU (ARMV9), GPU and NPU instruction sets. I think I am halfway on the reverse engineering, the debugging part is the hardest problem. You would however not be able to sell software with it on the APP Store as Apple forbids undocumented API's or bare metal instructions.

  • MuffinFlavored 6 days ago

    This would get rid of needing Metal to be the blackbox and enable things like "nvptx CUDA" equivalent / https://libc.llvm.org/gpu/ right?

    Very interesting. A steal for $22k but I guess very niche for now...

    • morphle 6 days ago

      Yes, knowing the exact CPU and ANE assembly instructions (or the underlying microcode!!) allows for general purpose software to adaptively compile processes on all the core types, not just the CPU ones. Its won't always be faster, you get more cache misses (some cores don't have cache) and different DMA and thread scheduling, some registers can't fit the floats or large integers, etc etc.

      But yes, it will be possible to use all 140 cores of the M2 Ultra or the 36 cores of the M4. There will be an M6 Extreme some day, maybe 500 cores?

      Actually, the GPU and ANE cores themselves are built from teams of smaller cores, maybe a few dozens, hundreds or thousand in all, same as in most NVDIA chips.

      >A steal for $22k but I guess very niche for now...

      A single iPhone or Mac app (a game, an LLM, pattern recognition, security app, VPN, de/encryption, video en/dec coder) that can be sped up by 80%-200% can afford my faster assembly level API.

      A whole series of hardware level zero-day exploits for iPhone and Mac would become possible, now that won't be very niche at all. It is worth millions to reverse Apple Silicon instruction sets.

      • MuffinFlavored 6 days ago

        What would a "llvm compilable" hello world look like that matches the libc GPU example for "AGX" (Apple Graphics)? It's not possible from MacOS, right? It'd have to be done from Linux?

        • morphle 6 days ago

          No, I don't think it is impossible for MacOS. I might be missing a detail here, not sure. I have to think it over.

          I have seen [1] you can patch ANECompilerService, so you can even speed up existing code, because Apple compiles your code just in time (at runtime) on each machine. We could do that for MacOS libc too.

          [1] Some how-to hints in https://discussions.apple.com/thread/254758525?sortBy=rank

          • MuffinFlavored 6 days ago

            How do you issue/execute "GPU" machine code instructions from MacOS not through Metal?

            • morphle 6 days ago

              You (or your compiler) write the instructions and data into unified memory (up to 192 GB) and jump to the first instruction (usually of a loop) on each core. GPU and ANE processor cores are not fundamentally different from CPU cores, they just have fewer transistors (gates) and therefore more limitations in what a register can address, what data type or what instruction it can execute. Some cores can only execute the same instruction as there neighbor core in a team, but on different data. Or at a different time, synchronized with neighbors. But they still are Turing complete processors so in essence are the same as their cousins the CPU cores. Sometimes cores input or output addresses are in a pipeline between cores (so it limits its address offset).

              MacOS only plays a role in allocating and protecting the instruction or data memory regions for the GPU and ANE processors.

              • MuffinFlavored 5 days ago

                I would like to discuss this more, shot you an email at the one listed here.

  • JackYoustra 6 days ago

    any place you have your current progress written up on? Any methodology I could help contribute on? I've read each one of the four links you've given over the years and it seems vague with how far people have currently gotten and exact issues.

    • morphle 6 days ago

      >Any methodology I could help contribute on?

      Several people have already contacted me today with this request. This is how I give out details and share current progress with you.

      Yes, you can help, most people on HN could. It is not that difficult work and it is not just low level debugging, coding and FPGA hardware. It is also organizing and even simple sales, talking to funders. With patience, you could even get paid to help.

      >any place you have your current progress written up on?

      Not any place in public, because of its value for zero-day exploits. This knowledge is worth millions.

      I'm in the process of rewriting my three scientific papers on reverse engineering Apple Silicon low level instructions.

      >it seems vague with how far people have currently gotten and exact issues.

      Yes, I'm afraid you're right, my apologies. It's very much detailed and technical stuff, some of it under patent and NDA, some even sensitive for winning economic wars and ongoing wars (you can guess those are exiting stories). It even plays a role in the $52.7 billion US, €43 billion EU and $150 billion (unconfirmed) Chinese Chips Acts. Apple Silicon is the main reason TSMC opened a US factory [1], keeping its instruction set details secret is deemed important.

      If you want more information, you should join our offline video discussions for more info. Maybe sometimes sign an NDA for the juicy bits.

      [1] https://www.cnbc.com/2024/12/13/inside-tsmcs-new-chip-fab-wh...

      • saagarjha 6 days ago

        While understanding the GPU’s microarchitecture might be useful for exploits it’s definitely not worth “millions”.

        • morphle 6 days ago

          You are right. The zero-day exploits might be worth roughly a million each, but not the family tree of native GPU's, ANE, CPU instruction sets and microarchitecture on which they would be based.

          My apology for writing unclearly, English is not my native language. I'm surprised it is yours.

          Saving on energy, programming effort and purchase cost of a supercomputer in case of M4 instruction sets and microarchitecture knowledge would also save millions.

          • saagarjha 4 days ago

            I doubt that, largely because nobody is really using Apple silicon for supercomputing efforts.

  • dgfitz 6 days ago

    It’s too bad they don’t make this easier on developers, Apple. Is there a reason I don’t see?

    • twoodfin 6 days ago

      Apple wants total freedom to rework lower levels of the stack down to the hardware, without worrying about application compatibility, hence their answer will continue to be Metal.

      • morphle 6 days ago

        I agree that it allows Apple to redefine Apple Silicon instruction sets without having do explain it to 3rd party software developers, but it is certainly not the main reason they hide the technical documentation of the chips.

        • MBCook 6 days ago

          Why not?

          Metal is the answer. Everything else is just implementation detail as GP said.

          Apple doesn’t provide developer support to other OSes. The only OS they do anything for* is macOS. So to them there’s no point.

          All they’d get is people relying on implementation details they shouldn’t, other companies stealing what they consider their trade secrets, or more surface area for patent trolls to scan.

          * Someone on the Asahi team, I think Hector Martin, has commented before the Apple is doing things that clearly seem designed to allow others to make and securely boot other OSes on their Apple Silicon hardware. They clearly could be clamping it down far more but are choosing not to. However that’s exactly as far as the support appears to go.

          • talldayo 6 days ago

            > Metal is the answer. Everything else is just implementation detail as GP said.

            You can say this as long as you want, Nvidia makes money hand-over-fist supporting CUDA alongside OpenCL and DirectX. It's all just business to them - they don't have to play the same game as Apple because they're just not quite so petty with the ecosystem politics.

            Look at MacOS, for example. Plenty of legacy software never was supported in Metal, it's "implementation detail" never manifested. It wasn't even really used in AI either until Apple upstreamed their own MPS hacks into Pytorch and people got BERT et. al. working, and even that was a pint-sized party trick that you could do on a Raspberry Pi. Apple themselves aren't even using their own servers for serious inference either, because you can't. It's gotta be offloaded to a lower-latency platform.

            It's not just that Metal as a platform has failed it's users, although it's certainly contributed to developers giving up on Mac hardware for serious compute. Apple's GPU design is stuck in iPhone mode and they refuse to change their approach with Apple Silicon desktop hardware. It was Apple's big bet on NPUs that hamstrung them, not an implementation detail, and if you don't believe me then wait and see. Xserve didn't tear down the 1U market, Asahi didn't upend Linux HPC, and Metal isn't going to upend AI compute any more than DirectX will. This is the same "Apple will get 'em next year" quote we always hear when they fuck up, and they never actually seem to swallow their pride and take notes.

            • astrange 6 days ago

              Apple are using their own servers for inference, that's the whole private cloud compute thing. Siri and other things use models and probably aren't running on it (though it's not announced), but those are older.

              > Apple's GPU design is stuck in iPhone mode and they refuse to change their approach with Apple Silicon desktop hardware.

              Looks competitive to me.

              https://venturebeat.com/ai/you-can-now-run-the-most-powerful...

              • talldayo 4 days ago

                > Apple are using their own servers for inference, that's the whole private cloud compute thing.

                Not for everything, though. Any ChatGPT/OpenAI-based inference request is being sent to Nvidia GPUs that run models too large for even the biggest Mac servers. You cannot refute this simply because Apple does not sell DGX-like server products. Even the rackmount Apple Silicon is still orders-of-magnitude off on the kind of performance you can get from a 1u GPU rack.

                > Looks competitive to me.

                When compared on equal grounds, Apple doesn't even have a GPU that beats Nvidia's 30XX series on power efficiency: https://browser.geekbench.com/opencl-benchmarks

                If it "looks competitive" to you, then I invite you to look closer than just qualitative evidence. Apple's 3nm desktop designs are losing in straight-shot comparisons with Nvidia's 8nm products.

        • lukeh 6 days ago

          I imagine it’s just efficient allocation of engineering resources.

        • amelius 6 days ago

          > but it is certainly not the main reason they hide the technical documentation of the chips

          What is the main reason?

          • morphle 6 days ago

            >What is the main reason?

            I can't guess what is the main reason. There might not even be a main reason, as many groups of people at Apple and its shareholders decided this over the years.

            (Also see my speculations below in this thread).

            So not in any order of importance to Apple:

            1) Create the same moat as NVIDIA has with CUDA.

            2) Ability to re-define the microcode instruction set of all the dozens of different Apple Silicon chips now and in the future without having to worry about backwards compatibility. Each Apple Silicon chip simply recompiles code at runtime (similar to my adaptive compiler).

            3) Zero hardware documentation needed, much cheaper PR and faster time to market, also making it harder to reverse engineer or repair.

            4) Security. Security by obscurity

            5) Keeping the walled garden up longer.

            6) Frustrating reverse engineering of Apple software. You must realize Apple competes with their own third party developers. Apple can optimize code on the GPU and ANE, third party developers can not and are forbidden too by Apple.

            7) Frustrating reverse engineering of Apple hardware.

            8) It won't make Apple more sales if 3rd party developers can write faster and more energy efficient GPU and NPU software.

            9) Legal and patent infringements considerations

            10) Future compiler improvements

            11 ) Trade secrets

            • amelius 6 days ago

              9) hiding known and/or unknown patent infringements

    • morphle 6 days ago

      There certainly is a reason and indeed you don't see it because Apple downplays these things in their PR.

      It might be the same reason that is behind NVDIA's CUDA moat. CUDA lock-in prevented competitors like AMD and Intel to convince programmers and their customers to switch away from CUDA. So there was no software ported to their competitive GPU's. So you get anti-trust lawsuits [1].

      I think you should put yourself in Apples management mindset and then reason. I suspect they think they will not sell more iPhones or Macs if they let third party developers access the low level APIs and write faster software.

      They might reason that if no one knows the instruction sets hackers will write less code to break security. Security by obscurity.

      They certainly think that blocking competitors from reverse engineering the low power Apple Silicon and blocking them from using TSMC manufacturing capacity will keep them the most profitable company for another decade.

      [1] https://news.ycombinator.com/item?id=40593576

      • _zoltan_ 6 days ago

        CUDA didn't prevent anything at least not in the way you believe.

        Intel and AMD had no competitive offer, period. They still don't.

        NVIDIA is simply offering an ecosystem that is battle tested and is ready out of the box. Look at the recent semianalysis test to see how not ready AMD is, who would be the only company to have a real shot at this. Their HW on paper is better or equal, yet their software ecosystem is nowhere ready.

        • AnthonyMouse 6 days ago

          > Look at the recent semianalysis test to see how not ready AMD is, who would be the only company to have a real shot at this. Their HW on paper is better or equal, yet their software ecosystem is nowhere ready.

          Reading that was kind of odd. It seems like their conclusion was that on paper AMD should be significantly less expensive and significantly faster, whereas in practice they're significantly less expensive and slightly slower because of unoptimized software, which actually seems like it'd still be a pretty good deal. Especially if the problem is the software, because then the hardware could get better with a software update after you buy it.

          They also spend a lot of time complaining about how much trouble it is to install the experimental releases with some improvements that aren't in the stable branch yet, but then the performance difference was only big in a few cases and in general the experimental version was only a couple of percent faster, which either way should end up in the stable release in the near future.

          And they do a lot of benchmarks on interconnect bandwidth which, fair enough, Nvidia currently has some hardware advantage. But that also mainly matters to the small handful of companies doing training for huge frontier models and not to the far larger number of people doing inference or training smaller models.

          It feels like they were more frustrated because they were using the hardware as the problems were being solved rather than after, even though the software is making progress and many of the issues have already been resolved or are about to be.

          • zaroth 6 days ago

            They literally spent months trying to work out the bugs. It’s an absolute admonishment of AMD’s software stack.

            Just look at their market value and it says everything you need to know about how much “better” AMD is than NVIDIA.

            • AnthonyMouse 5 days ago

              > They literally spent months trying to work out the bugs.

              That's kind of the point. They spent months working out bugs that are now worked out. Which sucks when you're the one to do it, so they're kind of bitter about it, but is pretty great for everyone who comes after them and the fixes have already made it into the drivers.

              > Just look at their market value and it says everything you need to know about how much “better” AMD is than NVIDIA.

              "The company makes more money" has a nasty tendency to be inversely correlated with value for money to the customer. Comparing the "market cap" of Oracle vs. pick your favorite open source database is not a great way to decide which one to use.

      • dylan604 6 days ago

        At this point, Apple is absolutely not afraid of an anti-trust lawsuit. To them, it is part of the cost of doing business

        • morphle 6 days ago

          I concur, they are virtually untouchable in this respect. No one else will throw a trillion or more into developing lower power faster silicon.

  • KeplerBoy 6 days ago

    Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the mac pro, which could support nics at that speeds (and above according to my quick maths#), but there is not really any driver support for modern Intel or Mellanox/Nvidia NICs as far as I can tell.

    My use case would be hooking up a device which spews out sensor data at 100 gbit/s over qsfp28 ethernet as directly to a GPU as possible. The new mac mini has the GPU power, but there's no way to get the data into it.

    # 2x Gen4x16 + 4x Gen3x8 = 2 * 31.508 GB/s + 4 * 7.877 GB/s ≈ 90 GB/s = 720 gbit/s

    • morphle 6 days ago

      > Where does the 270 gbit/s networking figure come from? Is it the aggregate bandwidth from the pcie slots on the Mac pro

      We both should restate and specify the calculation for each different Apple Silicon chip and the PCB/machine model it is wired onto.

      The $599 M4 Mac mini base model networking (aggregated Wifi, USB-C, 10G Ethernet, Thunderbolt PCIe) is almost 270 Gbps. Your 720 Gbps is for a >$8000 Mac Pro M2 Ultra but the number is to high because the 2x Gen4x16 is shared/oversubscribed with the other PCIe lanes for x8 PCIe slots, SSD and Thunderbolt. You need to measure/benchmark it, not read the marketing PR.

      I estimate the $1400 M4 Pro Mac mini networking bandwidth by adding the external WiFi, 10 Gbps Ethernet, two USC-C ports (2 x 10 Gbps) and three Thunderbolt 4 ports (3 x 80/120 Gbps) but subtracting the PCIe 64 Gbps limit and not counting the internal SSD. Two $599 M4 Mac mini base models are faster and cheaper than one M4 Pro Mac mini.

      The point of the precise actual measurements I did of the trillion opereations per second and the billion of bits per second networking/interconnect of the M4 Mac mini against all the other Apple silicon machines is to find which package (chip plus pcb plus case) has the best price/performance/watt balanced against them networked together. On januari 2025 you can build the cheapest fastest supercomputer in the world from just off the shelf M4 16Gb Mac mini base models with 10G Ethernet, Mikrotek 100G switches and a few FPGA's. It would outperform all Nvidia, Cerebras, Tenstorrent and datacenter clusters I know of, mainly because of the low power Apple Silicon.

      Note that the M4 has only 1,2 Tips unified memory bandwidth and the M4 Pro has double that. The 8 Tops unified memory bandwidth is on the M1 and M2 Studio Ultra with 64/128/192GB DRAM. Without it you cant's reach 50 trillion operations per second. A Mac Studio has only around 190 Gbps external networking bandwidth but does not reach 43 trillion TOPS, as does the 720 Gbps of your Mac Pro estimate. By reverse engineering the instruction set you could squeeze a few percent extra performance out of this M4 cluster.

      The 43 trillion TOPS of the M4 itself is an estimate. The ANE does 34 TOPS, the CPU less than 5 TOP depending on float type and we have no reliable benchmarks for the CPU floating point.

      • _zoltan_ 6 days ago

        It's very weird to add together all kinds of very different networking solutions (WiFi, wired ethernet, TB) and talk about their aggregate potential bandwidth as a single number.

        • TimSchumann 6 days ago

          Adding together all the different standards/feature sets a chip supports and then aggregating the bandwidth into a single number is actually a very reasonable way to arrive at an approximation for total chip computational throughput.

          Ultimately, unless the chip architecture is oversubscribed or overloaded (unsure what the right term is), the features are all meant to be used simultaneously and thus the bits being read/written have to come from somewhere.

          That somewhere is a % of the total throughput of the chip.

          Stated another way — people forget that there’s almost always a single piece of silicon backing the total bandwidth throughput of modern computing devices regardless of what ‘standard’ is being used.

      • KeplerBoy 6 days ago

        The pcie configuration was taken from the mac pro and it's m2 ultra. https://www.apple.com/mac-pro/

        I'd assume the mac mini has a less extensive pcie/tb subsystem.

        No idea what people are doing with all those pcie slots except for nvme cards. I wonder how hard it would be to talk to a pcie fpga.

        • morphle 6 days ago

          You use SerDes high speed serial links (up to 224 Gbps in 2025) to communicate between chips. A PCIe lane is just a Serdes with a 30% packet protocol overhead that uses DMA to copy bytes between to SRAM or DRAM buffers.

          You aggregate PCIe lanes (x16, x8, x4/Thunderbolt, x1). You could also built mesh networks from SerDes but now instead of PCIe switches You would need SerDes switches or routers (Ethernet, NVlink, Infiniband).

          You need those high speed links between chips for much more than SSD/NVME cards. Other NAS, Processors, Ethernet/internet, Camera, Wifi, Optics, DRAM, SRAM, power etc. For intercore communication (between processors or between chiplets), between networked PCB's, between DRAM chips (DDR5 is just another SerDes protocol), Flash Chips, camera chips, etc. Any other chip at faster then 250 Mbps speeds.

          I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

          My talk [1] on Wafer Scale Integration and free space optics goes deeper into how and why SerDes and PCIe will be replaced by fiber optics and free space optics for power reasons. I'm sure several parallel 2 Ghz optic lambdas per fiber (but no SerDes!) will be the next step in Apple Silicon as well: the M4 power budget already is mostly in the off-chip SerDes/Thunderbolt networking links.

          [1] https://vimeo.com/731037615

          • KeplerBoy 6 days ago

            > I aggregate all the M4 Mac mini ports into a M4 cluster by mesh networking all its Serdes/PCIe with FPGAs into a very cheap low power supercomputer with exaflop performance. Cheaper than NVDIA. I'm sure Apple does the same in their data centers.

            That sounds super interesting, do you happen to have some further information on that? Is it just a bunch of FPGAs issuing DMA TLPs?

            • ricktdotorg 6 days ago

              sounds (at least at a high level) similar to EXO[1]

              [1] https://github.com/exo-explore/exo

              • morphle 6 days ago

                Here a video of testing Exo to run huge LLMs on a cluster of M4 Macs[1] more cheaply than with a cluster of NVDIA RTX 4090s.

                [1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

                • menaerus 5 days ago

                  They show a test-run of a 1B llama-3.2 model. Doesn't that fit in a single mac? Distributing the workload in this case must be slower than running it on a single machine.

                  However, this is interesting and I'm confused why aren't they showcasing the test-run of a larger model that actually necessitates distributing the workload across the cluster.

            • morphle 6 days ago

              It is not the first time they built super computers from off the shelf Apple machines [1].

              M4 supercomputers are cheaper and it also will be lower Capex and Apex for most datacenter hardware.

              >do you happen to have some further information on that?

              Yes, the information is in my highly detailed custom documentation for the programmers and buyers of 'my' Apple Silicon super computer, Squeak and Ometa DSL programming languages and adaptive compiler. You can contact me for this highly technical report and several scientific papers (email in my profile).

              Do you know of people who might buy a super computer based on better specifications? Or even just buyers who will go for 'the lowest Capex and the lowest Opex supercomputer in 2025-2027'?

              Because the problem with HPC is that almost all funders and managers buy supercomputers with a safe brand name (Nvidia, AMD, Intel) at triple the cost and seldom from a super computer researcher as myself. But some do, if they understand why. I have been designing, selling, programming and operating super computers since 1984 (I was 20 years old then), this M4 Apple Silicon Cluster will be my ninth supercomputer. I prefer to build them from the ground up with our own chip and wafer scale integration designs but when an off-the-shelf chip is good enough I'll sell that instead. Price/Performance/Watt is what counts, ease of programming is a secondary consideration for what performance you achieve. Alan Kay argues you should rewrite your software from scratch [2] and do your own hardware [3] so that is what I've done sinds I learned from him.

              >Is it just a bunch of FPGAs issuing DMA TLPs?

              No. The FPGA's are optional for when you want to flatten the inter-core (=inter-SRAM cache) networking with switches or routers to a shorter hop topology for the message passing like a Slim fly diameter two hop topology [4].

              DMA (Direct Memory Access) TLPs (Transaction Layer Packets) are one of the worst ways of doing inter-core and inter-SRAM communication and on PCIe it has a huge 30% protocol overhead at triple the cost. Intel (and most other chip companies like NVIDIA, Altera, AMD/XILINX) can't design proper chips because they don't want to learn about software [2]. Apple Silicon is marginally better.

              You should use pure message passing between any process, preferably in a programming language and a VM that uses pure message passing at the lowest level (Squeak, Erlang). Even better if you then map those software messages directly to message passing hardware as in my custom chips [3].

              The reason to reverse Apple Silicon instructions for CPU, GPU and ANE are to be able to adapt my adaptive compiler to M4 chips but also to repurpose PCIe for low level message passing with much better performance and latency than DMA TLPs.

              To conclude, if you want to get the cheapest Capex and Opex M4 Mac mini supercomputer you need to rewrite your supercomputing software in a high level language and message passing system like the parallel Squeak Smalltalk VM [3] with adaptive load balancing compilation. C, C++, Swift, MPI or CUDA would result in sub-optimal software performance and orders of magnitude more lines of code when optimal performance of parallel software is the goal.

              [1] https://en.wikipedia.org/wiki/System_X_(supercomputer)

              [2] https://www.youtube.com/watch?v=ubaX1Smg6pY

              [3] https://vimeo.com/731037615

              [4] https://www.youtube.com/watch?v=rLjMrIWHsxs

    • morphle 6 days ago

      >but there's no way to get the data into it at 100 Gbps

      I'm confident you can get 100 Gbps in by aggregating M4 Mac mini ports.

      I resell a $199 Microtik CCR2004-1G-2XS-PCIe SmartNIC with 2 x 25 Gbps QSFP28 that connects to a x8 PCIe 3.0. (I still have a few available for $140 plus shipping plus a few refurbished 16 x 10 Gbps for $400 and 8 x 100 Gbps switches for $800).

      Theoretically you can connect that SmartNIC to two of the three M4 Mac mini Thunderbolt 4/USB4 ports that pass through 2 x x4 PCIe 3.0, if you can figure out how to aggregate the two x4 PCIe lanes into a single x8 port. The driver source code is for Linux and could be ported to MacOS. You then aggregate the ports with the 100 Gbps switch.

      I'm pretty sure you could create a new PCB design with a larger Broadcom switch chip model to attach to the 10G Ethernet, two 10 Gbps USB-C ports plus the three Thunderbolt 4/USB4 port and write a new driver to aggregate over the 6 ports. You'd have 126 Gbps minus the PCIe overhead and could combine it into a single 100 Gbps QSFP28 port.

      I already warned this is still theoretical. Broadcom might not sell you the switch chip, Intel might not sell you the Thunderbolt chip and Apple might block the installation of your device driver code.

      But people already proved the interconnect with the Apple Thunderbolt Bridge driver at 3 x 10 Gbps connected via large expensive Thunderbolt hubs [2]. Others just connect each port to different M4 Macs [1][3][4] in various ways.

      [1] https://x.com/alexocheema/status/1807882764261417000

      [2] https://www.youtube.com/watch?v=GBR6pHZ68Ho

      [3] https://www.youtube.com/watch?v=2eNVV0ouBxg

      [4] https://www.youtube.com/watch?v=SkmrUWyZThQ

      • KeplerBoy 6 days ago

        I know this should all be possible, but isnt really because Apple doesnt care about this use case.

        I'll just stick to Orin AGX modules with their proper pcie slot and real Linux support. I want to do something meaningful with the incoming data and not waste years just getting the link up and running.

barkingcat 6 days ago

There is no general purpose GPU development on Apple M series.

There is Metal development. You want to learn Apple M-series gpu and gpgpu development? Learn Metal!

https://developer.apple.com/metal/

  • kristianp 6 days ago

    > There is no general purpose GPU

    That's what GPGPU stands for. So your 2 sentences contradict each other.

rgovostes 6 days ago

It's hard to answer not knowing exactly what your aim is, or your experience level with CUDA and how easily the concepts you know will map to Metal, and what you find "restricted and convoluted" about the documentation.

<Insert your favorite LLM> helped me write some simple Metal-accelerated code by scaffolding the compute pipeline, which took most of the nuisance out of learning the API and let me focus on writing the kernel code.

Here's the code if it's helpful at all. https://github.com/rgov/thps-crack

  • nixpulvis 6 days ago

    2024 and still finding cheat codes in Tony Hawk Pro Skater 2. Wild!

    • selimthegrim 6 days ago

      If Jamie Kennedy is reading this, we still haven’t found the cheat code to make you funny.

billti 6 days ago

If you know CUDA, then I assume you know a bit already about GPUs and the major concepts. There’s just minor differences and different terminology for things like “warps” etc.

With that base, I’ve found their docs decent enough, especially coupled with the Metal Shader Language pdf they provide (https://developer.apple.com/metal/Metal-Shading-Language-Spe...), and quite a few code samples you can download from the docs site (e.g. https://developer.apple.com/documentation/metal/performing_c...).

I’d note a lot of their stuff was still written in Objective-C, which I’m not that familiar with. But most of that is boilerplate and the rest is largely C/C++ based (including the Metal shader language).

I just ported some CPU/SIMD number crunching (complex matrices) to Metal, and the speed up has been staggering. What used to take days now takes minutes. It is the hottest my M3 MacBook has ever been though! (See https://x.com/billticehurst/status/1871375773413876089 :-)

dylanowen 6 days ago

People have already mentioned Metal, but if you want cross platform, https://github.com/gfx-rs/wgpu has a vulkan-like API and cross compiles to all the various GPU frameworks. I believe it uses https://github.com/KhronosGroup/MoltenVK to run on Macs. You can also see the metal shader transpilation results for debugging.

  • rudedogg 6 days ago

    With what the OP asked for, I don't think wgpu is the right choice. They want to push the limits of Apple Silicon, or do Apple platform specific work, so an abstraction layer like wgpu is going in the opposite direction in my opinion.

    Metal, and Apple's docs are the place to start.

    • PittleyDunkin 6 days ago

      Indeed. I'm curious how much overhead there is in practice given the fact that the hardware wasn't designed to provide vulkan support. I honestly have no clue what to expect.

  • grovesNL 6 days ago

    wgpu has its own Metal backend that most people use by default (not MoltenVK).

    There is also a Vulkan backend if you want to run Vulkan through MoltenVK though.

    • dylanowen 6 days ago

      Oh good to know! It's been a while since I've looked at the osx implementation

    • tehsauce 6 days ago

      the metal backend does currently generate quite a lot of unnecessary command buffers, but in general performance seems solid.

feznyng 6 days ago

Besides the official docs you can check out llama.cpp as an example that uses metal for accelerated inference on Apple silicon.

rowanG077 6 days ago

If you are open to run Linux you can use standard opencl and vulkan.

TriangleEdge 6 days ago

Why not OpenCL or OpenGL? You'll not be constrained by the flavor of GPU.

  • nox101 6 days ago

    Sounds like you've never actually tried running those two APis across platforms?

    if you want portable use WebGPU either via wgpu for rust or dawn for C++ They actually do run on Windows, Linux, Mac, iOS, and Android portably

    • thrtythreeforty 6 days ago

      wgpu Just Works from C++ as well. Both projects implement the webgpu.h API

amelius 6 days ago

Apple is known to actively discourage general purpose computing. Better try a different vendor.

  • saagarjha 6 days ago

    idk about “known” considering they basically created OpenGL

    • talldayo 5 days ago

      If OpenGL is your most up-to-date reference for Apple supporting general purpose computing then I think it absolutely emphasizes how little work they've put in.

      • saagarjha 5 days ago

        Metal is not particularly bad.

        • talldayo 4 days ago

          Metal is neither a general-purpose compute API or a known example of Apple supporting it. If you consider Metal "general purpose" then so is DirectX or even Playstation and Nintendo shaders simply because you can sum matrices.

          Accelerate framework might be what you're looking for here, but by most accounts that hasn't improved very much recently. And even still, Accelerate is no analog for OpenCL or CUDA.

          • saagarjha 4 days ago

            Metal is basically equivalently powerful to CUDA, so I'm not entirely sure what you're getting at here. I mean, it's literally Apple's CUDA but for their own hardware.

    • mixmastamyk 6 days ago

      That was SGI.

      • astrange 6 days ago

        I think that's a typo. Apple created OpenCL.

        OpenGL then added compute shaders to make twice the implementation cost for the same feature.

        • saagarjha 5 days ago

          Yeah autocorrect doesn’t like me this week

  • codr7 6 days ago

    Preferably one that sells computers, not fashion statements.

    • likeabbas 6 days ago

      It's not a fashion statement, it's a fucking deathwish