Hardfault handler for embedded systems

Hardfault can be one of the most feared bugs for embedded systems, especially when they occur at client's home.
Yet, in my experience, a great number of the embedded products that exist today fail to handle this adequately.
Most of the hardfault handlers I encountered either, in the worst case, enter a forever loop, in a better case, reboot the system, or in the best case, save the core registers and then reboot the system.
Unfortunately, in the real world, those help us very little in debugging the cause of the hardfault.
What we truly need is to save both the core registers and the stack of the violating context, similar to how a segmentation fault is handled in Linux.
With this data, we can reconstruct the callstack that led to the hardfault and find the root cause of it.

Here are general guidelines for building a hardfault handler:
I will also include a link to an example implementation in the comments.

1. Hardfault IRQ
Obviously, the first step is to create the hardfault IRQ.
This is a naked function, usually in assembly, its implemented in accordance to the manufacturer's instructions and it usually comes with the SDK.
For ARM, a good example can be found in the FreeRTOS documentation

2. Reserve a dedicated area in non-volatile memory
Reserve a RAM segment (reserved in the linker script) or a page of the internal flash to store the saved data.
Using the RAM approach may be a little safer in our already compromised system.
Using Flash conserves precious RAM and allow using hard reset if needed.

3. Copy the registers
Copy the MCU's core registers, including any additional "interesting" registers (e.g., SCB registers in Cortex-M MCUs), to the reserved memory area.

4. Identify the context of the violation
We need to find the start address of the stack of the violating context.
The violation could occur in the main context (interrupts, main function, idle task) or a process context (OS task).
You need to refer to the documentation of the MCU in order to see if you need the main context's stack or a process context's stack.
In ARM MCUs, the LR register can reveal this information (see Architecture Reference Manual - ARMv7).

The start of the main context stack is usually defined in the linker script.
The stack of the tasks are known to us because, well, we created the tasks.

It is also possible to get only the callstack and not the entire stack, for this you'll need to use some external library like backtrace.

5. Copy the stack
The SP register value should point to the end of the stack (note that in here we refer to the SP at the time of the violation, the value we copied in step 3).
The beginning of the stack is what we found in the previous step.
Copy the memory between the SP register and the beginning of the stack to the reserved memory.

6. Reset the MCU

7. Retrieve the data from the non-volatile memory
After system reboot, when no longer in the hardfault state, retrieve the saved data from the non-volatile memory area.
This data can now be safely logged or sent to a remote server for analysis.

8. Analyzing the data
This can be done using applications like address_to_line in order to build the callstack of the violation.
Another, much less sophisticated option is simply to scan the stack dump with your eyes for numbers that look like flash addresses, and then look them up in the disassembly window of the IDE.

Please leave any comments, remarks or improvements you have in the comments.
I especially welcome comments about how to analyze the data or fetch the callstack.

#embedded #software #developer