The FFT I just pointed you at *is* a radix-2 FFT. An FFT can be pipelined and either radix 2 or radix 4 (or radix 8 and higher--but no one does that). It can also be a block FFT that is radix-2 or radix-4.
The big difference between radix-2 and radix-4 are the numbers of inputs (and output) to the butterfly. A radix-2 FFT consumes two inputs and produces two outputs. A radix-4 butterfly consumes 4 inputs and produces 4 outputs. If you follow that math, for the first stage of a N-point FFT, using a decimation in frequency approach, a radix-4 algorithm will need to store the incoming values into memory until it has values k, k+N/4, k+N/2, and k+3*N/4, for k from 0 to N/4-1. The butterflies will then only operate for 1/4 of of the time, and need to wait for inputs the other 3/4. Similarly, the FFT will produce four outputs at once, while one can move on to the next stage, all the others will need to go into a memory. Hence, your memory requirements for this stage will go up from N block RAM points to 2N, although this new stage will now accomplish the work of two of the radix-2 stages.
As for delays ... aside from filling memories, I'm not sure: I've never built a radix-4 FFT butterfly in HDL (yet). I'm not sure how I'd go about handling the the three complex multiplies required. Right now for my radix 2 FFT, I only have to deal with one complex multiply which I can then turn into three real multiplies. With a radix-4 butterfly, does that mean I'd be using 12 real multiplies? Or would those 12 somehow need to be multiplexed to share DSP hardware. I'm not sure--I've never built one.
Normally, you just accept the delay of the FFT in your code. Why are you so concerned about the delay? May I ask what application you are trying to solve?