RAK3172 gets bricked if restarted while config is changing

Hello Support,

I have been exhaustively trying to resolve an issue for which I might just find the reason: a corrupted configuration saved in the RAK3172 Flash.

Background
I have developed a product based on the RAK3172 and we currently have around 30 units running in a Pilot before can approve mass production. Those units are going through some tests and after some time in the field, a few of them the RAK3172 module stopped working completely.

The symptom is that the RAK3172 module simply doesn’t print anything in the Serial port nor respond to any AT command. I also verified that those units were not in Boot Load mode. The RAK3172 power consumption is around 7mA so it’s definitely not in Stop mode.

At this stage, the RAK3172 is basically bricked and the only way I found to recover the unit was to Fully Erase and reflash the firmware using ST-Link (SWIO). Once this is done the bricked RAK3172 gets back to life and works normally.

Issue Investigation
Further investigating this problem in the Lab, today I run a power cycle test. Basically turning the device ON/OFF multiple times at random intervals… and bingo! I was able to “brick” the RAK3172 module again.

Now with a method to “replicate” the potential issue, I did the following:

  1. Flash the RAK3172 with the latest firmware (3.4.11) over SWIO
  2. Read all the RAK3172 Flash memory back via SWIO and store it in a HEX file
  3. Run my power cycle test until I have the RAK3172 bricked again
  4. Re-read all the RAK3172 Flash memory via SWIO again and store it in another HEX file
  5. Compare both HEX files to look for any difference

What I could see is that the “bricked” RAK3172 module had some different data at the end of the Flash memory, which I guess it’s the Flash area reserved for the module Configuration like EUI, APP KEY, etc.

The Configuration Flash address seems to be from 0x0803F000 on the RAK3172 STM32 memory.

I have the 3 HEX dumps but it doesn’t seem to be possible to attach them here.

  • One from the unit Full Erased + Flashed with FW v3.4.11
  • One from the unit after powered up, configured and running
  • One from the unit after the restart test, once it got bricked

In this case, I’m attaching the DIFF screenshots but I’m happy to share the HEX files if you need them for testing/investigation.


Proposed Solution
My guess for the cause of the issue is that the STM32 on the RAK3172 module gets restarted at the moment a Flash save operation is happening and the data is not valid anymore. This invalid configuration causes the RAK3172 Firmware to get stuck.

If that’s the case I would recommend implementing some form of Flash Configuration CRC/verification, which the RAK3172 Firmware can validate it. If the CRC is invalid, the Firmware simply resets the configuration to keep the unit alive. A more elegant solution would have 2 Flash Configuration areas where the CRC is validated before switching between the two areas.

Workaround
Unfortunately, I can’t think of a workaround for this problem. Although rare, the final product might be restarted at any time. If that coincides with the moment RAK3172 is saving the configuration the unit is bricked and there’s no way to be recovered in the field.

Please let me know if there’s something I can do to prevent this from occurring.

Thanks and Regards,
Mike M.

Thanks for sharing your findings @Wisen . This will help us fix the issue and improve the implementation better. This concern is already raised to the RUI3 team and we will do improvements asap.

1 Like

Hello @Wisen, could you elaborate on what Flash save operations to which you are referring?
Are you saying that within your custom Firmware you are saving to Flash at some point and during that process the RAK3172 has a Brownout Reset? And that possibly is causing corruption?

Regards

Hi @pmjackson thanks for your questions, that’s a valid one and I should clarify better my use case.

I’m using the stock RAK3172 Firmware RUI3 Version 3.4.11, currently the latest one available. I’m using only AT Commands to interact with the module.

The mentioned operations that will trigger the Flash to be written are AT Commands for configuration changes that must be persisted in RAK3172’s memory like: AT+DEVEUI=1122334455667788, AT+APPKEY=01020AFBA1CD4D20010230405A6B7F88, etc

Any AT Command that internally will cause the RAK3172 Firmware to write in its own Flash for persistence should fall in this same scenario.

In my specific product, we have a main MCU that interacts with the RAK3172 over Serial using AT Commands. In our case, we do configure, for example, AT+DEVEUI=XXXXXXXXXXXXXX at the start-up of our application.

If a Brownout Reset occurs during those AT Commands used to configure the RAK3172 module, there’s a chance of the module’s Firmware getting stuck (and bricked) with bad/corrupted/incomplete data stored in RAK3172’s Flash memory.

Unfortunately, this is not as uncommon as it might sound, especially when users are replacing batteries or powering up and down the unit unexpectedly.

Regards

@Wisen ,

Thank you for the extra explanation.

Regards,

Hi all,

Looks like I’ve also encountered on the same thing, RAK3172 gets bricked and don’t respond.
I recover it through stm32 bootloader, but only when do full chip erasing, not just sectors occupied by firmware as default (this adds confidence that bug is related to corrupted parameters).

In contrast with situation above I didn’t turn off power while making configuration, just played with double resting the module. When I tried this sequence for RST pin: 0, 100ms, 1, 100ms, 0 100ms, 1 100ms, module dies.
So from my point of view, it looks like firmware writing something if flash on boot, even without user configuration. If so, it sound like a bug itself!
And of course powering off on parameters saving must not lead to module bricks. Yes, parameters could be potentially reset to defaults in this case, due to chip flash writing technology through clearing whole sector first; but module must work. Please check parameters values at least or whole param flash area CRC, and/or implement elegant solution with two flash areas as @Wisen kindly suggested.

1 Like

Hi @lukegluke ,

The device getting bricked is now fixed on rui3 v4.0.0 which will be released soon (still under testing). Improvement on flash tasks are improved to solved this issue.

2 Likes

Hi @carlrowan ,

Nice to hear, looking forward for new relase, thanks.

Hello @lukegluke.

Question. Are you runnning an Arduino custom firmware on RAK3172?
If so, could you add “api.system.sleep.all(1000);” at the beginning of Setup and see if it makes a difference?
It seems to help greatly for my custom firmware.
Without it, I can brick the device with random Resets.
With it, I can not brick it with many, many attemps.
I am still using RUI3.5.3.

Regards

1 Like

Hello @pmjackson

No, I’m not using custom firmware, just a stock one RUI3 Version 3.4.11 with AT commands from external MCU.