That's true normally for most workloads and entities, but like I said, if it takes 30 days to run your workload once (and you can't speed it up with more A100's / H100's / etc because you've hit a bottleneck someplace - look at their graph showing diminishing returns as they added GPUs to a given task, ), then additional performance even if Perf/W was worse would be worth it up to a point, if you can afford it.
From Tesla's own real world usage we can see that there are diminishing returns the more GPUs you add. While performance might go up from 64 to 128 (not indicated in the graph), clearly going higher is not really getting you much, and there's no way to significantly reduce the time it takes to run such large jobs.
AI Day 2 @ 2:15:44
View attachment 859103
Whether the point where Perf/W becomes more important again is at a week per job, or a day per job, or an hour per job, versus the original 30 days per job, is entirely going to depend on the goals and resources (i.e. how much money to throw at it) of the entity running the job. I would guess that if Dojo was only as fast (due to bottlenecks or whatever) as their existing GPU clusters, they would not have proceeded with it. But even cutting execution time in half to ~2 weeks was clearly worth it, at least to Tesla, even with the costs of engineering Dojo and all that goes with that.
For such large jobs, that take so long to run, the old saying of "time is money" applies - reducing the time it takes to run the job outweighs the raw Perf/W, if the value of getting those results back faster are suffiently high enough to offsite the extra costs.
Though, at a claimed potential 2x~4x the performance of an A100 die for die, depending on the workload, 2.8x ~ 1.75x the watts per die (comparing official A100 TDPs for various models to the claimed 104kW for a 150-tile 'system tray' for Dojo from the AI Day 2 presentation), ignoring the rest of supporting system power usage (since there's no easy way to determine that for either cluster type), Perf/W is either not much worse or is much better for Dojo, anyways. So Perf/W that close together becomes extra irrelevant compared to the execution speedup. If we were talking double the performance for a job that finishes in minutes, seconds, or less, then of course Perf/W is king. But going from 30 days to ~2 weeks and potentially even faster has significant business implications, that far outweigh the traditional Perf/W implications.