1

Background

I am optimizing a function that reads from a Real-Time Clock (RTC) structure and stores the values into an array of 8-bit bytes.

The original function used direct assignments, the compiler generated alternating load and store instructions:

struct Time_t
{
  uint32_t year;
  uint32_t month;
  uint32_t day;
  uint32_t hours;
  uint32_t minutes;
  uint32_t seconds;
};

void F1_Time_To_Buffer(uint8_t * p_buffer)
{
    Time_t date_time;
   (void)Get_RTC_Time(&date_time); //Discard return code
    p_buffer[0] = (uint8_t) (date_time.year - 2000U);
    p_buffer[1] = (uint8_t) date_time.month;
    p_buffer[2] = (uint8_t) date_time.day;
    p_buffer[3] = (uint8_t) date_time.hours;
    p_buffer[4] = (uint8_t) date_time.minutes;
    p_buffer[5] = (uint8_t) date_time.seconds;
};

The compiler generates code that is a load (from memory) then stores to memory for each statement:

                  void F1_Time_To_Buffer(uint8_t * p_buffer)
                  {
   \                     F1_Time_To_Buffer: (+1)
   \        0x0   0xB518             PUSH     {R3,R4,LR}
   \        0x2   0xB087             SUB      SP,SP,#+28
   \        0x4   0x0004             MOVS     R4,R0
                      Time_t date_time;
                     (void)Get_RTC_Time(&date_time); //Discard return code
   \        0x6   0x4668             MOV      R0,SP
   \        0x8   0x....'....        BL       Get_RTC_Time
                      
                      p_buffer[0] = (uint8_t) (date_time.year - 2000U);
   \        0xC   0x9806             LDR      R0,[SP, #+24]
   \        0xE   0x3030             ADDS     R0,R0,#+48
   \       0x10   0x7020             STRB     R0,[R4, #+0]
                      p_buffer[1] = (uint8_t) date_time.month;
   \       0x12   0x9805             LDR      R0,[SP, #+20]
   \       0x14   0x7060             STRB     R0,[R4, #+1]
                      p_buffer[2] = (uint8_t) date_time.day;
   \       0x16   0x9804             LDR      R0,[SP, #+16]
   \       0x18   0x70A0             STRB     R0,[R4, #+2]
                      p_buffer[3] = (uint8_t) date_time.hours;
   \       0x1A   0x9803             LDR      R0,[SP, #+12]
   \       0x1C   0x70E0             STRB     R0,[R4, #+3]
                      p_buffer[4] = (uint8_t) date_time.minutes;
   \       0x1E   0x9802             LDR      R0,[SP, #+8]
   \       0x20   0x7120             STRB     R0,[R4, #+4]
                      p_buffer[5] = (uint8_t) date_time.seconds;
   \       0x22   0x9801             LDR      R0,[SP, #+4]
   \       0x24   0x7160             STRB     R0,[R4, #+5]
                  };
   \       0x26   0xB008             ADD      SP,SP,#+32
   \       0x28   0xBD10             POP      {R4,PC}          ;; return

The optimized version performs all loading from memory (as a block), then stores into memory as a block:

                  void F2_Time_To_Buffer(uint8_t * p_buffer)
                  {
   \                     F2_Time_To_Buffer: (+1)
   \        0x0   0xB578             PUSH     {R3-R6,LR}
   \        0x2   0xB087             SUB      SP,SP,#+28
   \        0x4   0x0004             MOVS     R4,R0
                      Time_t date_time;
                      (void)Get_RTC_Time(&date_time); //Discard return code
   \        0x6   0x4668             MOV      R0,SP
   \        0x8   0x....'....        BL       Get_RTC_Time
                      const uint8_t   year        = (uint8_t) (date_time.year - 2000U);
   \        0xC   0x9806             LDR      R0,[SP, #+24]
   \        0xE   0x3030             ADDS     R0,R0,#+48
                      const uint8_t   month       = (uint8_t) date_time.month;
   \       0x10   0x9905             LDR      R1,[SP, #+20]
                      const uint8_t   day         = (uint8_t) date_time.day;
   \       0x12   0x9A04             LDR      R2,[SP, #+16]
                      const uint8_t   hours       = (uint8_t) date_time.hours;
   \       0x14   0x9B03             LDR      R3,[SP, #+12]
                      const uint8_t   minutes     = (uint8_t) date_time.minutes;
   \       0x16   0x9D02             LDR      R5,[SP, #+8]
                      const uint8_t   seconds     = (uint8_t) date_time.seconds;
   \       0x18   0x9E01             LDR      R6,[SP, #+4]
                      p_buffer[0] =   year;
   \       0x1A   0x7020             STRB     R0,[R4, #+0]
                      p_buffer[1] =   month;
   \       0x1C   0x7061             STRB     R1,[R4, #+1]
                      p_buffer[2] =   day;
   \       0x1E   0x70A2             STRB     R2,[R4, #+2]
                      p_buffer[3] =   hours;
   \       0x20   0x70E3             STRB     R3,[R4, #+3]
                      p_buffer[4] =   minutes;
   \       0x22   0x7125             STRB     R5,[R4, #+4]
                      p_buffer[5] =   seconds;
   \       0x24   0x7166             STRB     R6,[R4, #+5]
                  }
   \       0x26   0xB008             ADD      SP,SP,#+32
   \       0x28   0xBD70             POP      {R4-R6,PC}       ;; return

The F2 function is more efficient since it reduces the chances of data cache misses.

Question

What is the effect on efficiency by grouping the storage statements as in the F2 function? {Explain how the data cache is involved.)

Notes

  1. Compiler is IAR Embedded Workbench IDE - Arm 8.42.2
  2. Compiled in debug mode (no or minimal optimizations) (purposely, because it resides on a medical device platform)
  3. The Get_RTC_Time() function reads data from an RTC hardware device.
  4. The Time_t structure has uint32_t fields to model the hardware device.
  5. The buffer is uint8_t to conserve memory space (platform is memory constrained).
7
  • "The F2 function is more efficient since it reduces the chances of data cache misses." Is this is statement or a question? What "data cache" are you talking about and why would F1 induce a cache miss?
    – Botje
    Commented Nov 8, 2023 at 21:22
  • @Botje F1 alternates between reading from date_time and writing to p_buffer. Since the accesses of each block of memory are not sequential, there's more chance for one of them to be pushed out of cache.
    – Barmar
    Commented Nov 8, 2023 at 21:26
  • E.g. suppose there's only 24 bytes of cache. It can't keep both in cache at the same time.
    – Barmar
    Commented Nov 8, 2023 at 21:27
  • "more chance"? And on what is that statement based? Are there perhaps multiple threads running on this device? Is this 24 bytes hypothetical or sourced from a datasheet? Do you have documentation that proves that the output buffer will be cached in this situation? This question is impossible to answer without it.
    – Botje
    Commented Nov 8, 2023 at 21:29
  • 1
    And what do you expect me to say, here? "Yes, if your hypothetical processor works in the way you claim it does, your hypothesis is correct" Not really a useful answer, I'm afraid. Find the datasheet.
    – Botje
    Commented Nov 8, 2023 at 21:48

2 Answers 2

1

Compiler is IAR Embedded Workbench IDE - Arm 8.42.2

It implies most likely ARM uC or MPU.

Reading from the RTC clock will take much more time than all of your store operations. So the processor will wait for the bus operations with the peripheral. Cache misses will have a marginal effect.

Also how often are you going to read RTC? Milions per second? Probably not. So your attempt to microoptimize makes no sense at all.

Please describe the problem you have and why you think that optimizing this function can help.

4
  • Actually, this is called in the 1ms ISR. Commented Nov 8, 2023 at 22:34
  • I'm curious how cache misses would affect the storage into memory. Commented Nov 8, 2023 at 22:35
  • @ThomasMatthews then it is a very bad code. Do not read it in the 1ms interrupt. It makes no sense at all Commented Nov 8, 2023 at 23:55
  • @ThomasMatthews In general, you should never do any copy/data shoveling inside an ISR. The background program should already have the data ready in the expected uint8_t format and then the ISR only need to swap a pointer to that buffer, rather than to waste time hard copying data.
    – Lundin
    Commented Nov 9, 2023 at 10:22
0

You didn't specify the target platform, but on many ARM cores caching would be a total non-factor. I would think that on many cores, the optimal way of accomplishing the packing of the data would be something like:

add r1,sp,#4
ldmia r1,{r2,r3,r4,r5,r6,r7}  ; Perform all five loads at once
adds r2,r2,#48
strb r2,[r0,#0]
strb r3,[r0,#1]
strb r4,[r0,#2]
strb r5,[r0,#3]
strb r6,[r0,#4]
strb r7,[r0,#5]

Perhaps the compiler grouped the loads and stores to increase the likelihood of it finding transforms like the above, but for whatever reason didn't happen to notice the possibility of substituting LDM. Alternatively, some compiler configurations might refrain from transforming sequences of loads into LDM instructions, since some cores would be able to process LDR instructions with unaligned addresses, but would be unable to process LDM instructions likewise. I wouldn't expect common compiler configurations to accommodate unaligned accesses in such fashion, but perhaps the compiler was configured that way because a some code in a project relies upon it.

4
  • The compiler is set a optimization level 0, so it's not going to do an ldmia instruction. :-( Commented Nov 8, 2023 at 22:33
  • @ThomasMatthews: What did you mean by "optimized" version?
    – supercat
    Commented Nov 8, 2023 at 22:52
  • The F2 function is more optimized for speed, by lowering the chances of data cache misses. Commented Nov 8, 2023 at 22:59
  • 2
    @ThomasMatthews: You just specified optimization level 0, which would imply that the compiler wouldn't be expected to make much effort to optimize anything.
    – supercat
    Commented Nov 8, 2023 at 23:04

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.