RAK3172 randomly losing connection to gateway

Have been playing with the example code LoRaWAN_OTAA, and everything works fine but sometimes, without any obvious reason, the api.lorawan.send API started to return false, indicating the message is not queued for TX, and it will continue to return false. The callback function is not triggered which means no messages are being sent. Any idea on that? The only way is to restart the device. This issue appears randomly, sometimes I can have 2000+ successful uplinks before having the issue, and sometimes it appears shortly after OTAA. The current solution is to reboot the device, but I don’t think that is an acceptable solution. Has anyone experienced a similar issue or maybe staff from RAK can give us some idea?

We have checked the gateway and confirm no message is sent from the device, and the gateway is fully functional as another node with SX1276 is working normally, for months now. We can confirm the issue is within the RAK3172 itself.

Also, is there any way to enable more detailed debug output? We previously worked with MCCI-LMIC and the amount of debug message available is very helpful for debugging issues and knowing what’s going on.

We are planning to replace old-school AVR+SX1276 combo with RAK3172, so we have to solve the issue.

Attached are small pieces of code around api.lorawan.send that we use

while(1){
//Read sensor, put data to buffer, etc...
if (api.lorawan.send(txbuffer_size, txbuffer, 8, false, 0)) {
    txstat.uplink_counter++;
  } else {
//Sometimes without reason "api.lorawan.send"
//start to return false and do not recover
    txstat.fail++;
  }
api.system.sleep.cpu(10000); //Wait for TX complete
printStatistic(); //Show statistic
api.system.sleep.cpu(30000); //Wait for next interval
}

@chansheunglong

I don’t see that problem, I have a RAK3172 running in my living room since two months (battery test) and it never failed.

What is your payload size?
Do you have ADR enabled or disabled?
What LoRaWAN region are you using?

@beegee

I am on AS923, payload size is 16 bytes, ADR is enabled and is SF7, using OTAA, connected to TTN with RAK’s gateway. I am sending 5 unconfirmed uplink than 1 confirmed uplink, and repeat.

Any info or reason for api.lorawan.send to return false? Is it related to the payload and or config?

Thanks for the support.

How often do you send packets?
An unconfirmed lorawan packet send cycle can take 6-7 seconds (including waiting for the end of the RX windows).

I am sending every ~40 seconds. I wait for 10 seconds using api.system.sleep.all after calling api.lorawan.send to wait for the whole TX process complete, than another 30 seconds to wait for the next interval. I can see for each call to api.lorawan.send return true the sendCallback() function is triggered. When api.lorawan.send return false, sendCallback() is not triggered.

I have the duty cycle limiter disabled

I am using AS923-3 which is basically the same as AS923.
I will run a test.

Is this correct:

  • ADR enabled
  • Duty Cycle disabled
  • Sending every 40 seconds.

Datarate with SF7 can be DR5 or DR6, which one do you set initially?

I have tried a few different delay after api.lorawan.send

/*1*/ api.system.sleep.cpu(10000); //Sleep for 10 seconds
/*2*/ api.system.sleep.cpu((api.lorawan.rx1dl.get() + api.lorawan.rx2dl.get()) + 1000); //Sleep for 5+2+1=8 seconds
/*3*/ for (uint8_t i = 0; i < 10; i++) api.system.sleep.cpu(1000); //Sleep for 10*1000ms = 10 seconds

The chance of randomly having api.lorawan.send returning false is higher when using (2) compare with (1).

For (3) it simply does not work, 1st call to api.lorawan.send return true, and the next call and afterwards always return false, which is surprising as we thought RUI will ensure TX is completed before allowing the sleep to actually happen.

For (2) the error is the same as (3) but happens much later, around 100 uplinks later, not exact

At this moment, I can see if api.lorawan.send return false, sercie_lora_send() in service_lora.c, return the following error -UDRV_BUSY
My best guess is the LoRa driver are stuck at some stage and unable to be reset/release

My initial configuration are as follow:

  //AS923 Region
  api.lorawan.band.set(RAK_REGION_AS923);

  //Join using OTAA Mode
  api.lorawan.njm.set(RAK_LORA_OTAA);

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#deviceclass
  api.lorawan.deviceClass.set(RAK_LORA_CLASS_A); //Class A

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#rety
  api.lorawan.rety.set(0); //No retry if fail to get confirm

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#cfm
  api.lorawan.cfm.set(false); //No confirmed uplink

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#adr
  api.lorawan.adr.set(true); //Use ADR (adaptive data rate)

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#dcs
  api.lorawan.dcs.set(false); //Ignore duty cycle limit

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#dr
  api.lorawan.dr.set(5); //Use DR5

  //https://docs.rakwireless.com/RUI3/LoRaWAN/#txp
  //https://docs.rakwireless.com/RUI3/Appendix/#tx-power-by-region
  api.lorawan.txp.set(7);

Thanks for the quick support! As we can only make the switch to RAK’s product if we can be sure everything works. We have old nodes running for more than 2 years continuously in harsh outdoor conditions without any restart and error, we do not want to break our reputation with our next release.

Sorry, I didn’t find time to start any test today.

Instead of using the api.system.sleep.cpu() function, did you consider a complete different firmware structure? Just calling sleep and hoping that the packet is sent is not very reliable (just my opinion).

I have examples that work really well that use the LoRaWAN callbacks and API timers for the data flow.

You can find the code in my Github repo RUI3-Sensor-Node-Air-Quality

I can confirm -UDRV_BUSY is the reason behind the sudden loss of connection. Seems like it happen if the program does not provide enough time for the driver to finish sending the uplink. (Basically meaning not long enough delay/sleep duration?)

IMHO the problem of -UDRV_BUSY should be handled inside RUI’s system, and should recover automatically, what I can observe is when the driver enter busy state, it stay at busy state.

I agree with your point, as api.lorawan.send only tells if the uplink is queued or not, we have no way to know if the uplink actually happen, which is not ideal. And without any recovering option or reset option, it is not reliable.

A suggestion is to add an advanced API such as api.lorawan.send_blocking which will only return after the TX has completed, allowing RUI to have full management during the uplink period (and also enter low power automatically). Better, the return result should be about TX has been completed or not.

In our firmware, we determine if the packet is sent by incrementing a counter within the sendCallback function, so that we know if the actual TX happened or not.

I notice that your example uses a timer to send at a fixed interval, however, the implementation is still calling api.lorawan.send and hope for the best.

Take Arduino-LMIC as an example, we can very easily determine the state of the driver. More, even when TX_FAILED is LMIC, the next uplink can still be successful, but in api.lorawan.send, it always fail to recover.