RAK4270 Error 5 - Potential causes?

Hello,

I have been using RAK4270 to develop a product and I’m getting “ERROR 5” from time to time just a few TX messages after joining the network.

Please note I’m running FW version v3.3.0.17 and I’m using the UART2 to communicate with the module. Below is the initialization, configuration and messages.

(Module Powered up and waiting 500ms.)
(The number on the left is the uptime in seconds)
0: at+version
0: OK RAK4270 v3.3.0.17

0: at+set_config=lora:app_key:(secret)
0: OK
0: at+set_config=lora:app_eui:(secret)
0: OK
0: at+set_config=lora:dev_eui:(secret)
0: OK
0: at+set_config=lora:adr:0
0: OK
0: at+set_config=lora:tx_power:0
0: OK
0: at+set_config=lora:dr:0
0: OK
0: at+set_config=lora:class:2
0: OK
0: at+set_config=lora:confirm:0
0: OK
0: at+set_config=lora:region:AU915
0: OK
0: at+set_config=lora:ch_mask:0:0
0: OK
0: at+set_config=lora:ch_mask:1:0
0: OK
0: at+set_config=lora:ch_mask:2:0
0: OK

(...)

3: at+set_config=lora:ch_mask:39:0
3: OK
3: at+set_config=lora:ch_mask:40:1
3: OK
3: at+set_config=lora:ch_mask:41:1
3: OK
3: at+set_config=lora:ch_mask:42:1
3: OK
3: at+set_config=lora:ch_mask:43:1
3: OK
3: at+set_config=lora:ch_mask:44:1
3: OK
3: at+set_config=lora:ch_mask:45:1
3: OK
3: at+set_config=lora:ch_mask:46:1
3: OK
3: at+set_config=lora:ch_mask:47:1
3: OK
3: at+set_config=lora:ch_mask:48:0
3: OK
3: at+set_config=lora:ch_mask:49:0
3: OK
3: at+set_config=lora:ch_mask:50:0
3: OK

(...)

4: at+set_config=lora:ch_mask:68:0
4: OK
4: at+set_config=lora:ch_mask:69:1
4: OK
4: at+set_config=lora:ch_mask:70:0
4: OK
4: at+set_config=lora:ch_mask:71:0
4: OK

4: at+join
11: OK Join Success
11: LoRa TX-> at+send=lora:20:800000D277
14: LoRa <-RX OK  (first message received fine by the GW)
14: LoRa <-RX at+recv=0,-108,-22,0 (GW sends back some MAC with the changes)
41: LoRa TX-> at+send=lora:10:0000000000000000002C00 (not received by the GW)
(nothing comes back)
(I keep sending data to the module to see if it still alive and after a few commands it return ERROR 5)
100: LoRa TX-> at+get_config=lora:status
101: LoRa TX-> at+get_config=lora:status
102: LoRa TX-> at+get_config=lora:status
103: LoRa TX-> at+get_config=lora:status
104: LoRa TX-> at+get_config=lora:status
105: LoRa TX-> at+get_config=lora:status
106: LoRa TX-> at+get_config=lora:status
107: LoRa TX-> at+get_config=lora:status
107: LoRa <-RX ERROR: 5

The “zero” length RX message is from the LoRaWAN GW/Server sending MAC message back. The Chirpstack server normally sends 3 or 4 downlink MAC MSGs after joining to make sure the device has the correct channels.

I can’t always replicate the issue but it seems to only happen when the signal between GW and Module is weak and also it only happens just after the “join” while receiving the MAC messages. Would be the case that a corrupted downlink MAC message is messing around with the module?

After a few uplinks, Chirpstack stops sending MAC downlink messages and if this happens the module stays up and happy without hanging anymore. The application will try to re-join every 24h, to make sure the connection is OK, and at this point, the error can occur again.

What are the potential causes for “ERROR 5” as I can only found “There is an error when sending data through the UART port.”?

After I get the “ERROR 5” once, the module does not work anymore. I can try to issue an “at+join” or another “at+send” and keep getting ERROR 5. Only power cycling the module fixes the issue.

Thanks!

I’m not familiar with the firmware of that module to know what that error message means, but speaking generally about LoRaWan,

Packet contents, including even the unecrypted parts, are still protected by cryptographic checksums, so a “corrupted” packet not being rejected for checksum failure (“MIC mismatch”) would be extremely rare, and not something you’d see happening repeatedly.

This is a mistake. Having a deployed device rejoin is just about never appropriate, and as you’re seeing introduces a failure opportunity. Essentially the only reason to ever rejoin would be the if the network server has lost the node’s sessions records (which is why it should have backups and run the server of any “production” network in the cloud rather than on an easily damaged or stolen gateway) or if you were moving between commercial pay-per-use networks.

Connection integrity within the (essentially supposed to be eternal) join session is managed by ADR-related MAC exchanges and header bits.

Hi @cstratton thanks for your input. I understand your point about re-join, there are also some good approaches here too: https://lora-developers.semtech.com/documentation/tech-papers-and-guides/the-book/joining-and-rejoining.

Now, just so we don’t divert the discussion, unfortunately, the issue also happens on the first join operation, which can occur in many cases when the product is deployed/powered on.

Let’s see if anyone familiar with the FW source code can help clarify why the module is presenting the ERROR 5 and needs to be restarted when the situation occurs.

More Info
I did additional tests I noticed that sometimes the module doesn’t even complete the join. The at+join command returned nothing - waited over 300 seconds. After that, I sent other commands, like at+version, at about the 8th or 9th command the ERROR 5 is presented back.

In the situation above I could confirm the join was received and accepted by the LoRaWAN server but seems that something went wrong inside the module, I presume when the “accept join” downlink message was received by the module, triggering the issue. The Join Request msg RSSI was -121 according to the LoRaWAN server, so very marginal link.

Another observation is that the was already detected in 3 out of 5 units I’m using for testing, so very unlikely to be a damaged module. To overcome the issue I currently have a check for “ERROR 5” in the product FW which triggers a module Power Cycle → Re-configure → Re-Join, but this is only a workaround and not a solution.

Many thanks!
Mike

The recommendations of (at least occasional) ADR or an explicit linkcheckrq at that page are a lot more sound than joining again, they unfortunately didn’t really get into the consequences of that.

Leaving that aside, the documentation for the RAK4270 says that “There is an error when sending data through the UART port.

Usually sending doesn’t cause an error, so it may be something else getting misidentified.

Given you kept firing queries at it (with how much time in between) one possibility I wonder about is some sort of overflow, possibly subsequent to some other failure.

Are you using US915? If so, the 11 bytes you’re trying to send would be the maximum permissible payload length for SF10 / DR0 - if the fopts field were empty.

But given this is right at the start of the session and you mentioned MAC traffic ongoing, there’s a fair chance that fopts have some MAC responses, meaning the maximum payload length is momentarily shorter.

I could easily imagine your packet getting through an “early” length check but failing some later one deeper down after fopts are added.

It might be interesting to see what happens if you send a 1-byte dummy payload a few times after joining.

But of course that may have nothing to do with it.

All the configuration details are in the dump on the first post: at+set_config=lora:region:AU915.

Regarding the payload size, that might not matter much as the error sometimes is observed before the at+join is even completed.

Difficult to guess what might be causing it without looking at the FW code behind :face_with_monocle:

Hello @carlrowan / @beegee , would any of you be able to shine some light regarding the issue I have been experiencing? Please let me know if there’s any additional information I can provide.

Getting close to large volume production but need to sort this one first :slight_smile:

Regards,
Mike M.

Hi @Wisen ,

I saw a support ticket request from you. I will support you here in forum for the benefit of others too.

If you experience error 5, it can related on reaching UART’s buffere limit of 256bytes. I will get some more info from SW team what could be other possible scenarios.

How do you transmit the AT commands to the module? Is the host a PC with automation script to UART? Or you have an MCU as host? Can you confirm that can be no issue on the command terminations /r/n? Were you able to test it in UART1?

Hi @carlrowan , thanks for that.

The issue occurs using in the module in my Product (MCU) or using the RAK Serial Tool. Please note all configuration details are in the first message. Yes, we’re using /r/n at the end of every command.

The “ERROR 5” makes sense now as it shows only after I keep trying to communicate with the module after it “hangs” by not responding to the last command. So it seems to be the symptom and not the cause. The interesting part is that the module stops responding when it’s joining at a very weak link.

The only way I can replicate the issue is by having a very weak signal between the GW and the module (normally RSSI -120 or worse). Note that the error doesn’t occur every time, about 5% of the time, making it challenging to replicate.

In this weak signal situation, my perception is that the module stops responding after receiving downlink messages like JoinAccept or any other MAC. My application is mostly uplink so I can’t tell if the same error would occur with a standard downlink message too. Maybe is something wrong when decoding “corrupted” downlink messages?

Finally, I haven’t experienced this issue not a single time when the signal between GW and module is stronger: RSSI -110 or better for example.

Regards,
Mike M.

HI @Wisen ,

I will try to replicate the issue by moving away from my gateway. Btw, what network server and gateway are you using?

It is holiday in China today but I will ask support from the SW team regarding the issue. They might get back to me next week. For now, I will try to experience the issue myself and check possible work around.

Network server: Chirpstack
Gateways Tested: MatchX, RAK7258 and Ursalink UG85
Considering the module stops working I wouldn’t worry much about the other side as the module should survive if a dodgy package is sent to it.

To replicate the issue in our Lab I removed the antennas and/or covered the device(s) with some metal shield. The sweet signal spot is when the at+join is seen by the Network/GW logs but sometimes fails in the node with ERROR 99.

Once I start seeing this weak/marginal link I repeat the steps below until I see the module hanging:

  1. Power ON the module;
  2. Configure the Module (you can use my config)
  3. Join the network: at+join → wait OK
  4. Send a message at+send → wait OK
  5. Repeat the send 3 or 4 times

If the module does not hang it’s unlike it will, so I restart the whole test again until I notice the module doesn’t return the OK anymore during the steps above.

HI @carlrowan, just checking if you have any update on that?
Thanks, Mike M.