RAK3172 randomly losing connection to gateway

chansheunglong · August 23, 2022, 5:25pm

Have been playing with the example code LoRaWAN_OTAA, and everything works fine but sometimes, without any obvious reason, the api.lorawan.send API started to return false, indicating the message is not queued for TX, and it will continue to return false. The callback function is not triggered which means no messages are being sent. Any idea on that? The only way is to restart the device. This issue appears randomly, sometimes I can have 2000+ successful uplinks before having the issue, and sometimes it appears shortly after OTAA. The current solution is to reboot the device, but I don’t think that is an acceptable solution. Has anyone experienced a similar issue or maybe staff from RAK can give us some idea?

We have checked the gateway and confirm no message is sent from the device, and the gateway is fully functional as another node with SX1276 is working normally, for months now. We can confirm the issue is within the RAK3172 itself.

Also, is there any way to enable more detailed debug output? We previously worked with MCCI-LMIC and the amount of debug message available is very helpful for debugging issues and knowing what’s going on.

We are planning to replace old-school AVR+SX1276 combo with RAK3172, so we have to solve the issue.

Attached are small pieces of code around api.lorawan.send that we use

while(1){
//Read sensor, put data to buffer, etc...
if (api.lorawan.send(txbuffer_size, txbuffer, 8, false, 0)) {
    txstat.uplink_counter++;
  } else {
//Sometimes without reason "api.lorawan.send"
//start to return false and do not recover
    txstat.fail++;
  }
api.system.sleep.cpu(10000); //Wait for TX complete
printStatistic(); //Show statistic
api.system.sleep.cpu(30000); //Wait for next interval
}

beegee · August 24, 2022, 2:38am

@chansheunglong

I don’t see that problem, I have a RAK3172 running in my living room since two months (battery test) and it never failed.

What is your payload size?
Do you have ADR enabled or disabled?
What LoRaWAN region are you using?

chansheunglong · August 24, 2022, 5:22am

@beegee

I am on AS923, payload size is 16 bytes, ADR is enabled and is SF7, using OTAA, connected to TTN with RAK’s gateway. I am sending 5 unconfirmed uplink than 1 confirmed uplink, and repeat.

Any info or reason for api.lorawan.send to return false? Is it related to the payload and or config?

Thanks for the support.

beegee · August 24, 2022, 6:36am

How often do you send packets?
An unconfirmed lorawan packet send cycle can take 6-7 seconds (including waiting for the end of the RX windows).

chansheunglong · August 24, 2022, 7:43am

I am sending every ~40 seconds. I wait for 10 seconds using api.system.sleep.all after calling api.lorawan.send to wait for the whole TX process complete, than another 30 seconds to wait for the next interval. I can see for each call to api.lorawan.send return true the sendCallback() function is triggered. When api.lorawan.send return false, sendCallback() is not triggered.

I have the duty cycle limiter disabled

beegee · August 24, 2022, 7:55am

I am using AS923-3 which is basically the same as AS923.
I will run a test.

Is this correct:

ADR enabled
Duty Cycle disabled
Sending every 40 seconds.

Datarate with SF7 can be DR5 or DR6, which one do you set initially?

chansheunglong · August 24, 2022, 8:17am

I have tried a few different delay after api.lorawan.send

/*1*/ api.system.sleep.cpu(10000); //Sleep for 10 seconds
/*2*/ api.system.sleep.cpu((api.lorawan.rx1dl.get() + api.lorawan.rx2dl.get()) + 1000); //Sleep for 5+2+1=8 seconds
/*3*/ for (uint8_t i = 0; i < 10; i++) api.system.sleep.cpu(1000); //Sleep for 10*1000ms = 10 seconds

The chance of randomly having api.lorawan.send returning false is higher when using (2) compare with (1).

For (3) it simply does not work, 1st call to api.lorawan.send return true, and the next call and afterwards always return false, which is surprising as we thought RUI will ensure TX is completed before allowing the sleep to actually happen.

For (2) the error is the same as (3) but happens much later, around 100 uplinks later, not exact

At this moment, I can see if api.lorawan.send return false, sercie_lora_send() in service_lora.c, return the following error -UDRV_BUSY
My best guess is the LoRa driver are stuck at some stage and unable to be reset/release

My initial configuration are as follow:

  //AS923 Region
  api.lorawan.band.set(RAK_REGION_AS923);

  //Join using OTAA Mode
  api.lorawan.njm.set(RAK_LORA_OTAA);

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#deviceclass
  api.lorawan.deviceClass.set(RAK_LORA_CLASS_A); //Class A

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#rety
  api.lorawan.rety.set(0); //No retry if fail to get confirm

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#cfm
  api.lorawan.cfm.set(false); //No confirmed uplink

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#adr
  api.lorawan.adr.set(true); //Use ADR (adaptive data rate)

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#dcs
  api.lorawan.dcs.set(false); //Ignore duty cycle limit

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#dr
  api.lorawan.dr.set(5); //Use DR5

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#txp
  //https://docs.rakwireless.com/RUI3/Appendix/#tx-power-by-region
  api.lorawan.txp.set(7);

Thanks for the quick support! As we can only make the switch to RAK’s product if we can be sure everything works. We have old nodes running for more than 2 years continuously in harsh outdoor conditions without any restart and error, we do not want to break our reputation with our next release.

beegee · August 24, 2022, 10:29am

Sorry, I didn’t find time to start any test today.

Instead of using the api.system.sleep.cpu() function, did you consider a complete different firmware structure? Just calling sleep and hoping that the packet is sent is not very reliable (just my opinion).

I have examples that work really well that use the LoRaWAN callbacks and API timers for the data flow.

You can find the code in my Github repo RUI3-Sensor-Node-Air-Quality

chansheunglong · August 24, 2022, 11:09am

I can confirm -UDRV_BUSY is the reason behind the sudden loss of connection. Seems like it happen if the program does not provide enough time for the driver to finish sending the uplink. (Basically meaning not long enough delay/sleep duration?)

IMHO the problem of -UDRV_BUSY should be handled inside RUI’s system, and should recover automatically, what I can observe is when the driver enter busy state, it stay at busy state.

I agree with your point, as api.lorawan.send only tells if the uplink is queued or not, we have no way to know if the uplink actually happen, which is not ideal. And without any recovering option or reset option, it is not reliable.

A suggestion is to add an advanced API such as api.lorawan.send_blocking which will only return after the TX has completed, allowing RUI to have full management during the uplink period (and also enter low power automatically). Better, the return result should be about TX has been completed or not.

In our firmware, we determine if the packet is sent by incrementing a counter within the sendCallback function, so that we know if the actual TX happened or not.

I notice that your example uses a timer to send at a fixed interval, however, the implementation is still calling api.lorawan.send and hope for the best.

Take Arduino-LMIC as an example, we can very easily determine the state of the driver. More, even when TX_FAILED is LMIC, the next uplink can still be successful, but in api.lorawan.send, it always fail to recover.

croman13n3c · July 17, 2023, 12:19pm

Good day,

encountered the same behavior. Only rebooting device fix problem

beegee · July 18, 2023, 7:28am

You can use the LoRaWAN callbacks to know when TX is finished and check the result.

/**
 * @brief Callback after TX is finished
 *
 * @param status TX status
 */
void sendCallback(int32_t status)
{
	Serial.printf("TX status %d\n", status);
}

enabled with

api.lorawan.registerSendCallback(sendCallback);

antoine · September 16, 2023, 5:57pm

I also observed this behavior a few times. Some of my devices just stop sending data. I was able to connect to the serial port to gather debug info from my device and I can see the call to api.lowaran.send returns false. Also calls to api.lorawan.join also returns false. If I reboot the device by cycling the power, everything is back to normal.

As a workaround, I am considering calling api.system.reboot() if send() returns false. Do you think this could work?

Would you have any other suggestions?

antoine · September 18, 2023, 12:50pm

The device that I power cycled 2 days ago stopped transmitting again. This is the same problem, calling api.lorawan.send returns false.

The device transmits an unconfirmed packed every 15 minutes, payload size 8 bytes, with SF10BW125. I have multiple identical devices that have been working for months and are still working.

I logged the LoRaWAN packes on the gateway and they are all identical.

This is likely a bug within the firmware. My code is based on firmware version RUI_4.0.5_RAK3172-E.

beegee · September 18, 2023, 1:30pm

Welcome back to the forum @antoine

Can you share your code? I have a RAK3172 with custom RUI3 firmware running here for weeks.

antoine · September 18, 2023, 2:34pm

I can’t share the code as it is because its proprietary, but the overall architecture is relatively simple.

In the loop, I call api.system.sleep.all(); and do some checks for digital inputs (handling of a button that is not used once deployed).

There are two timers

one to check our sensor and send an unconfirmed message every 15 minutes.
another one to send the battery level every 24 hours

Otherwise we have a recvCallback to handle downlink payload and code to save/load parameters from the flash, handle calibration, and other special modes.

In the setup(), we configure like this:

api.lorawan.band.set(OTAA_BAND); api.lorawan.deviceClass.set(RAK_LORA_CLASS_A); api.lorawan.appeui.set(node_app_eui, 8); api.lorawan.appkey.set(node_app_key, 16); api.lorawan.deui.set(node_device_eui, 8); api.lorawan.njm.set(RAK_LORA_OTAA); api.lorawan.rety.set(3); api.lorawan.cfm.set(1); api.lorawan.adr.set(false); api.lorawan.txp.set(0); api.lorawan.registerRecvCallback(recvCallback); api.lorawan.registerJoinCallback(joinCallback); api.lorawan.registerSendCallback(sendCallback);

beegee · September 19, 2023, 1:56am

Without the code, I cannot say much.

Only thing that comes into my mind, are you using flags to make sure that both timers do not try to send while a TX is still ongoing?

antoine · September 20, 2023, 1:00am

I’m not. Can you please elaborate?

Meanwhile, the device stopped working again, but its different. There is a button on my device with an interrupt on an IO pin to wake it up. In the previous cases, the button was working to wake the device up but the send failed. In this last case, the device won’t wake up.

I connected the STM32CubeProgrammer to the device with a STLink to read the memory and compare with the original hex file, and found a few bytes changed. Seems suspicious.

I’ll reflash the device and test again.

Thanks for your help.

beegee · September 20, 2023, 1:49am

What I mean with the two timers checking if one has triggered is like this:

Timer one action	Timer two action
triggered
start sending
still sending	triggered
still sending	start sending
problem	problem

When you call api.lorawan.send a second time before it has finished its current TX/RX cycle, it will throw errors. As one of your timer is running every 24h, the other one every 15 minutes, it can happen once a day.

About the flash
If you change any parameters (through API or AT commands), they change is saved in the user configuration sector in the flash. That can cause differences.

antoine · September 20, 2023, 12:20pm

I inspected past data and yes, there was a conflict between my timers. Once a day, when the battery level is sent, my other data point is missing. This is very likely the problem you describe and I’ll fix my code.

That being said, the devices continue sending normally afterward so this is likely unrelated with the problem reported at the start of this discussion thread.

I’ll keep on gathering information, hoping to get to the bottom of this.

What are the addresses of the user configuration sector in the flash?

beegee · September 20, 2023, 12:35pm

I am not sure about the address, this is somewhere deep inside the BSP and the code is not open sourced.