Overall I mostly agree. I am not sure that I would call things "efficiency" but that is just terminology. Seems like your talking really execution time be it of the critical path or as some sort of sum of parallel threads or maybe energy consumption. Also as far as I can see, the rule of what to optimize work vs. step really comes down to fitting the sub-tasks into an integral multiple of the number of cores, and probably as I think you said the smallest integral multiple possible.
There are exceptions to this to. Some processors have thermal limits so thermal load, clock speed, and usable cores can be important. There are also memory bandwidth limitations. Then there is code and code generation optimization and vectorization which I'm not sure fits directly into your model very well. Plus there is the effect of vectorization on memory bandwidth limitations. I know on my workstation I can get the same step time often in two ways, no vectorization and using all real cores OR vectorization using only half of the real cores. I would need higher memory bandwidth to actually use both. That is 4 port versus the more common 2 port memory.
Not sure what I am getting at. Maybe that more or less I agree, but at least for me thinking specifically about what efficiency means and how a task might be mapped onto hardware seems more clear. Or maybe I just miss-understand the whole thing.