Have been playing with the example code LoRaWAN_OTAA, and everything works fine but sometimes, without any obvious reason, the api.lorawan.send API started to return false, indicating the message is not queued for TX, and it will continue to return false. The callback function is not triggered which means no messages are being sent. Any idea on that? The only way is to restart the device. This issue appears randomly, sometimes I can have 2000+ successful uplinks before having the issue, and sometimes it appears shortly after OTAA. The current solution is to reboot the device, but I don’t think that is an acceptable solution. Has anyone experienced a similar issue or maybe staff from RAK can give us some idea?
We have checked the gateway and confirm no message is sent from the device, and the gateway is fully functional as another node with SX1276 is working normally, for months now. We can confirm the issue is within the RAK3172 itself.
Also, is there any way to enable more detailed debug output? We previously worked with MCCI-LMIC and the amount of debug message available is very helpful for debugging issues and knowing what’s going on.
We are planning to replace old-school AVR+SX1276 combo with RAK3172, so we have to solve the issue.
Attached are small pieces of code around api.lorawan.send that we use
while(1){
//Read sensor, put data to buffer, etc...
if (api.lorawan.send(txbuffer_size, txbuffer, 8, false, 0)) {
txstat.uplink_counter++;
} else {
//Sometimes without reason "api.lorawan.send"
//start to return false and do not recover
txstat.fail++;
}
api.system.sleep.cpu(10000); //Wait for TX complete
printStatistic(); //Show statistic
api.system.sleep.cpu(30000); //Wait for next interval
}
I am on AS923, payload size is 16 bytes, ADR is enabled and is SF7, using OTAA, connected to TTN with RAK’s gateway. I am sending 5 unconfirmed uplink than 1 confirmed uplink, and repeat.
Any info or reason for api.lorawan.send to return false? Is it related to the payload and or config?
I am sending every ~40 seconds. I wait for 10 seconds using api.system.sleep.all after calling api.lorawan.send to wait for the whole TX process complete, than another 30 seconds to wait for the next interval. I can see for each call to api.lorawan.send return true the sendCallback() function is triggered. When api.lorawan.send return false, sendCallback() is not triggered.
I have tried a few different delay after api.lorawan.send
/*1*/ api.system.sleep.cpu(10000); //Sleep for 10 seconds
/*2*/ api.system.sleep.cpu((api.lorawan.rx1dl.get() + api.lorawan.rx2dl.get()) + 1000); //Sleep for 5+2+1=8 seconds
/*3*/ for (uint8_t i = 0; i < 10; i++) api.system.sleep.cpu(1000); //Sleep for 10*1000ms = 10 seconds
The chance of randomly having api.lorawan.send returning false is higher when using (2) compare with (1).
For (3) it simply does not work, 1st call to api.lorawan.send return true, and the next call and afterwards always return false, which is surprising as we thought RUI will ensure TX is completed before allowing the sleep to actually happen.
For (2) the error is the same as (3) but happens much later, around 100 uplinks later, not exact
At this moment, I can see if api.lorawan.send return false, sercie_lora_send() in service_lora.c, return the following error -UDRV_BUSY
My best guess is the LoRa driver are stuck at some stage and unable to be reset/release
My initial configuration are as follow:
//AS923 Region
api.lorawan.band.set(RAK_REGION_AS923);
//Join using OTAA Mode
api.lorawan.njm.set(RAK_LORA_OTAA);
//https://docs.rakwireless.com/RUI3/LoRaWAN/#deviceclass
api.lorawan.deviceClass.set(RAK_LORA_CLASS_A); //Class A
//https://docs.rakwireless.com/RUI3/LoRaWAN/#rety
api.lorawan.rety.set(0); //No retry if fail to get confirm
//https://docs.rakwireless.com/RUI3/LoRaWAN/#cfm
api.lorawan.cfm.set(false); //No confirmed uplink
//https://docs.rakwireless.com/RUI3/LoRaWAN/#adr
api.lorawan.adr.set(true); //Use ADR (adaptive data rate)
//https://docs.rakwireless.com/RUI3/LoRaWAN/#dcs
api.lorawan.dcs.set(false); //Ignore duty cycle limit
//https://docs.rakwireless.com/RUI3/LoRaWAN/#dr
api.lorawan.dr.set(5); //Use DR5
//https://docs.rakwireless.com/RUI3/LoRaWAN/#txp
//https://docs.rakwireless.com/RUI3/Appendix/#tx-power-by-region
api.lorawan.txp.set(7);
Thanks for the quick support! As we can only make the switch to RAK’s product if we can be sure everything works. We have old nodes running for more than 2 years continuously in harsh outdoor conditions without any restart and error, we do not want to break our reputation with our next release.
Sorry, I didn’t find time to start any test today.
Instead of using the api.system.sleep.cpu() function, did you consider a complete different firmware structure? Just calling sleep and hoping that the packet is sent is not very reliable (just my opinion).
I have examples that work really well that use the LoRaWAN callbacks and API timers for the data flow.
I can confirm -UDRV_BUSY is the reason behind the sudden loss of connection. Seems like it happen if the program does not provide enough time for the driver to finish sending the uplink. (Basically meaning not long enough delay/sleep duration?)
IMHO the problem of -UDRV_BUSY should be handled inside RUI’s system, and should recover automatically, what I can observe is when the driver enter busy state, it stay at busy state.
I agree with your point, as api.lorawan.send only tells if the uplink is queued or not, we have no way to know if the uplink actually happen, which is not ideal. And without any recovering option or reset option, it is not reliable.
A suggestion is to add an advanced API such as api.lorawan.send_blocking which will only return after the TX has completed, allowing RUI to have full management during the uplink period (and also enter low power automatically). Better, the return result should be about TX has been completed or not.
In our firmware, we determine if the packet is sent by incrementing a counter within the sendCallback function, so that we know if the actual TX happened or not.
I notice that your example uses a timer to send at a fixed interval, however, the implementation is still calling api.lorawan.send and hope for the best.
Take Arduino-LMIC as an example, we can very easily determine the state of the driver. More, even when TX_FAILED is LMIC, the next uplink can still be successful, but in api.lorawan.send, it always fail to recover.
You can use the LoRaWAN callbacks to know when TX is finished and check the result.
/**
* @brief Callback after TX is finished
*
* @param status TX status
*/
void sendCallback(int32_t status)
{
Serial.printf("TX status %d\n", status);
}
I also observed this behavior a few times. Some of my devices just stop sending data. I was able to connect to the serial port to gather debug info from my device and I can see the call to api.lowaran.send returns false. Also calls to api.lorawan.join also returns false. If I reboot the device by cycling the power, everything is back to normal.
As a workaround, I am considering calling api.system.reboot() if send() returns false. Do you think this could work?
The device that I power cycled 2 days ago stopped transmitting again. This is the same problem, calling api.lorawan.send returns false.
The device transmits an unconfirmed packed every 15 minutes, payload size 8 bytes, with SF10BW125. I have multiple identical devices that have been working for months and are still working.
I logged the LoRaWAN packes on the gateway and they are all identical.
This is likely a bug within the firmware. My code is based on firmware version RUI_4.0.5_RAK3172-E.
I can’t share the code as it is because its proprietary, but the overall architecture is relatively simple.
In the loop, I call api.system.sleep.all(); and do some checks for digital inputs (handling of a button that is not used once deployed).
There are two timers
one to check our sensor and send an unconfirmed message every 15 minutes.
another one to send the battery level every 24 hours
Otherwise we have a recvCallback to handle downlink payload and code to save/load parameters from the flash, handle calibration, and other special modes.
Meanwhile, the device stopped working again, but its different. There is a button on my device with an interrupt on an IO pin to wake it up. In the previous cases, the button was working to wake the device up but the send failed. In this last case, the device won’t wake up.
I connected the STM32CubeProgrammer to the device with a STLink to read the memory and compare with the original hex file, and found a few bytes changed. Seems suspicious.
What I mean with the two timers checking if one has triggered is like this:
Timer one action
Timer two action
triggered
start sending
still sending
triggered
still sending
start sending
problem
problem
When you call api.lorawan.send a second time before it has finished its current TX/RX cycle, it will throw errors. As one of your timer is running every 24h, the other one every 15 minutes, it can happen once a day.
About the flash
If you change any parameters (through API or AT commands), they change is saved in the user configuration sector in the flash. That can cause differences.
I inspected past data and yes, there was a conflict between my timers. Once a day, when the battery level is sent, my other data point is missing. This is very likely the problem you describe and I’ll fix my code.
That being said, the devices continue sending normally afterward so this is likely unrelated with the problem reported at the start of this discussion thread.
I’ll keep on gathering information, hoping to get to the bottom of this.
What are the addresses of the user configuration sector in the flash?