Background
I am optimizing a function that reads from a Real-Time Clock (RTC) structure and stores the values into an array of 8-bit bytes.
The original function used direct assignments, the compiler generated alternating load and store instructions:
struct Time_t
{
uint32_t year;
uint32_t month;
uint32_t day;
uint32_t hours;
uint32_t minutes;
uint32_t seconds;
};
void F1_Time_To_Buffer(uint8_t * p_buffer)
{
Time_t date_time;
(void)Get_RTC_Time(&date_time); //Discard return code
p_buffer[0] = (uint8_t) (date_time.year - 2000U);
p_buffer[1] = (uint8_t) date_time.month;
p_buffer[2] = (uint8_t) date_time.day;
p_buffer[3] = (uint8_t) date_time.hours;
p_buffer[4] = (uint8_t) date_time.minutes;
p_buffer[5] = (uint8_t) date_time.seconds;
};
The compiler generates code that is a load (from memory) then stores to memory for each statement:
void F1_Time_To_Buffer(uint8_t * p_buffer)
{
\ F1_Time_To_Buffer: (+1)
\ 0x0 0xB518 PUSH {R3,R4,LR}
\ 0x2 0xB087 SUB SP,SP,#+28
\ 0x4 0x0004 MOVS R4,R0
Time_t date_time;
(void)Get_RTC_Time(&date_time); //Discard return code
\ 0x6 0x4668 MOV R0,SP
\ 0x8 0x....'.... BL Get_RTC_Time
p_buffer[0] = (uint8_t) (date_time.year - 2000U);
\ 0xC 0x9806 LDR R0,[SP, #+24]
\ 0xE 0x3030 ADDS R0,R0,#+48
\ 0x10 0x7020 STRB R0,[R4, #+0]
p_buffer[1] = (uint8_t) date_time.month;
\ 0x12 0x9805 LDR R0,[SP, #+20]
\ 0x14 0x7060 STRB R0,[R4, #+1]
p_buffer[2] = (uint8_t) date_time.day;
\ 0x16 0x9804 LDR R0,[SP, #+16]
\ 0x18 0x70A0 STRB R0,[R4, #+2]
p_buffer[3] = (uint8_t) date_time.hours;
\ 0x1A 0x9803 LDR R0,[SP, #+12]
\ 0x1C 0x70E0 STRB R0,[R4, #+3]
p_buffer[4] = (uint8_t) date_time.minutes;
\ 0x1E 0x9802 LDR R0,[SP, #+8]
\ 0x20 0x7120 STRB R0,[R4, #+4]
p_buffer[5] = (uint8_t) date_time.seconds;
\ 0x22 0x9801 LDR R0,[SP, #+4]
\ 0x24 0x7160 STRB R0,[R4, #+5]
};
\ 0x26 0xB008 ADD SP,SP,#+32
\ 0x28 0xBD10 POP {R4,PC} ;; return
The optimized version performs all loading from memory (as a block), then stores into memory as a block:
void F2_Time_To_Buffer(uint8_t * p_buffer)
{
\ F2_Time_To_Buffer: (+1)
\ 0x0 0xB578 PUSH {R3-R6,LR}
\ 0x2 0xB087 SUB SP,SP,#+28
\ 0x4 0x0004 MOVS R4,R0
Time_t date_time;
(void)Get_RTC_Time(&date_time); //Discard return code
\ 0x6 0x4668 MOV R0,SP
\ 0x8 0x....'.... BL Get_RTC_Time
const uint8_t year = (uint8_t) (date_time.year - 2000U);
\ 0xC 0x9806 LDR R0,[SP, #+24]
\ 0xE 0x3030 ADDS R0,R0,#+48
const uint8_t month = (uint8_t) date_time.month;
\ 0x10 0x9905 LDR R1,[SP, #+20]
const uint8_t day = (uint8_t) date_time.day;
\ 0x12 0x9A04 LDR R2,[SP, #+16]
const uint8_t hours = (uint8_t) date_time.hours;
\ 0x14 0x9B03 LDR R3,[SP, #+12]
const uint8_t minutes = (uint8_t) date_time.minutes;
\ 0x16 0x9D02 LDR R5,[SP, #+8]
const uint8_t seconds = (uint8_t) date_time.seconds;
\ 0x18 0x9E01 LDR R6,[SP, #+4]
p_buffer[0] = year;
\ 0x1A 0x7020 STRB R0,[R4, #+0]
p_buffer[1] = month;
\ 0x1C 0x7061 STRB R1,[R4, #+1]
p_buffer[2] = day;
\ 0x1E 0x70A2 STRB R2,[R4, #+2]
p_buffer[3] = hours;
\ 0x20 0x70E3 STRB R3,[R4, #+3]
p_buffer[4] = minutes;
\ 0x22 0x7125 STRB R5,[R4, #+4]
p_buffer[5] = seconds;
\ 0x24 0x7166 STRB R6,[R4, #+5]
}
\ 0x26 0xB008 ADD SP,SP,#+32
\ 0x28 0xBD70 POP {R4-R6,PC} ;; return
The F2
function is more efficient since it reduces the chances of data cache misses.
Question
What is the effect on efficiency by grouping the storage statements as in the F2
function?
{Explain how the data cache is involved.)
Notes
- Compiler is IAR Embedded Workbench IDE - Arm 8.42.2
- Compiled in debug mode (no or minimal optimizations) (purposely, because it resides on a medical device platform)
- The
Get_RTC_Time()
function reads data from an RTC hardware device. - The
Time_t
structure hasuint32_t
fields to model the hardware device. - The buffer is
uint8_t
to conserve memory space (platform is memory constrained).
date_time
and writing top_buffer
. Since the accesses of each block of memory are not sequential, there's more chance for one of them to be pushed out of cache.