I think it is best to view it as a car assembly line.
The quickest way to assemble a car is for each team to do their step one after each other - a car will take the combined time of all operations, but you are only building one car at a time. Fine if you a build a single Mclaren F1 race car.
The quickest way to assemble lots of cars is a production line. Each team does their step, and then the car moves on to the next team for the next process. Making the first car takes (number of steps) * (length of longest step). But once you have your first car you get another car every (length of longest step).
But what if mounting a motor takes five minutes, and the next longest operation takes only three minutes? You can only make one car every five minutes, and the rest of the teams are idle for at least 40% of the time.
The solution might be to split mounting the engine into two steps, each taking up to three minutes. Then you can make a car every three minutes. In FPGA-speak, this is pipelining to get the highest clock speed. Big gains are easy at first, but then it gets harder and harder to get any faster, as more parts of the design become critical to the timing of the overall pipeline.
The other solution might be to combine pairs of the three minute steps so no step takes longer than 5 minutes. That way you only need half the resources, yet can still produce a car every five minutes... this is the "once you meet your FPGA's timing constraints, then re-balancing the pipeline can save you resources" strategy.