i'll try to explain you:
every processor has a little banch of general purpose registers and a smaller banch of private registers,
the number of registers is implementation dependent and, just for example, private registers include 1 register for the current istruction e 1 for program counter.
every processor makes the loop:
- fetch
- decode
- execute
- check interruption
now, we assume 1 single processor not parallel, and this code ( i use D-RISC for this example)
CLEAR R2
loop: LOAD R1, R2, R3
LOAD R4, R2, R5
ADD R3, R5, R6
STORE R7, R2, R6
INCR R2
IF< R2, R8, loop
this can be
for (i = 0; i < N; i++) C[i] = A[i] + B[i];
in our example we have 7 istruction and a for loop that will be executed N times:
so we have (1 (CLEAR) + 6 * N (LOOP)) * T
were T is the average time that processor does the work of fetch, decode, execute, check interrupt.
every time processor wants to fetch an istruction send a request to a module (MMU) that does the translation from logical address to fisical address.
MMU speaks with L1 cache and returns the istruction to processor (if there is not a fault, at this point i will talk about paginated memory, segmented memory, TLBs, ecc., too long...).
processor --> MMU --> TLB --> L1 --> ...
for every fetch (and eventualy, other access to the memory for istruction m-m or r-m).
when processor owns the istruction ( in IR register) can do the decodification and the execution of the istruction.
every istruction (class of istruction) has different T (Register-Register arithmetical short operation can have 1 clock circle to execute,ecc).
if there are cache faults we have (1 + 6*N) * Tex + Nfault * Tfault
I have written enough, for pipeline processors or multicore or istruction cache e data cache you have to understand this before.
this are the concepts i have, probably is only didattic stuff, but is the best i can do.
if someone has less didattic, more real world experience/concepts correct me.
as always, sorry for my english