RAK2245/3 crashing after 68hours / 63000packets

OK now I have an idea I think.
First of all, nice location bro, I first saw the dish and I thought, wow those guys are crazy :slight_smile:
Anyways how do you know it has crashed, you stop seeing messages in TTN ?
If so than actually TTN has an issue with that I believe, after a certain number of frames it crashes unless you reset the frame counter. I am not sure it is that, you got to tell me what exactly the crashing process is, does the gateway become unresponsive, can you SSH to the PI, etc.
Furthermore I did use balen.io, but it was several months ago, it was called resin.io back than. It is very nice, with fancy interface and a nice way of making the image and you can distribute it to many gateways (probably why you chose it) however I had the same issue, gateway was crashing after some time. I solved it by instaling the packet forwarder from the RAK repo on git. I know it is a d… move, but the support guy from resion.io never answered to my query about it. I think I caught the time when they were restructuring from resin to balena and it got lost. So sorry I am not sure how to fix the balena itself.
This being said there is a reasonably large probability that it is one of the two issues I described none of which has anything to do with the hardware itself.
Are you set on using the balena.io because of the fleet distribution and updating system, or are you open to exploring new ways as for example flashing the image from the RAK831 repo here:
https://github.com/RAKWireless/RAK831-LoRaGateway-RPi

There is a decent chance this could solve it perhaps. This is if it is the balena issue.
Can you try this and see what happens.
I hope what I said make sense, please correct me if I am wrong in my assumptions.

Basically go without balena and perhaps reset the frame counter once a day if it crashes after lets say 4 days for example and see if it makes it to the fifth.
If it does we are happy cause we identified the problem, and also said cause we got no way to solve it :slight_smile:

Keep me posted please I am curious as I have considered going back to balena at some point and want to know if the issues is in there.

Hi, @De_Drie_Bronnen

You can refer to @Hobo 's experience and try to find the reason. You can also use RAK another software project of RAK2245 for testing. It is based on Raspbian OS:
https://github.com/RAKWireless/RAK2245-LoRaGateway-RPi-Raspbian-OS

BTW, i think you can use loraserver.io as NS for testing too. Actually, if you are using https://github.com/RAKWireless/RAK2245-LoRaGateway-RPi-LoRa-Gateway-OS, there is already a NS in your RAK2245 LoRa gateway.:grinning: You can use it without connecting to Internet freely. Just refer to this tutorial:
http://docs.rakwireless.com/en/LoRa/RAK2245-Pi-HAT/Application-Notes/Get_Start_with_RAK_LoRa_Develop_Kit.pdf

Hi!
Because our old gateway also crashed (in the sense that it wasn’t forwarding packets anymore, was still SSH accessible) we will start with changing our NS to loraserver.io and lets see if it can receive more packets that way without crashing. if that doesnt work we will try the integrated one. I will keep this thread updated on our progress!
thank you all

Hi Erza,

Did you however do as I recommended. Try to reset the packet count and go around Balena and flash the packet forwarder to raspbian stretch directly ?

Also you mentioned that you have not rebooted it since it is hard to get too. Both balena and raspbian stretch allow for remote reset. Perhaps try this.

Hi! I have not done this yet, I was out of the country until today. The problem is that since we haven’t used balena with the 2243 we can not remote to it, we wanted to get it up there as fast as possible and didnt configure the remote SSH correctly…
The gateway will be returned to us later this week and we will reflash the packet forwarder and install remote tools.

Hmm, what about the 831, it is using balena, right. Maybe try to reset the counter there.
But the 2243 still crashing even without Balena is not a good sign. I would love if you keep us posted. This is something I would like to see solved, I have never done more than a 30 000 packets to be honest, so perhaps I should test it myself. Got to hook up a bunch of nodes to transmit every 10 secs and see what happens.
It should work with both TTN and LoRaServer locally. I have both options, as a matter of fact on the same gateway and never had issues with one or the other, but I never had such load.
Hm… this is so interesting.

So we have the gateway back here. we are going to set up a stress test with some nodes sending packets every few seconds for days and well see when it crashes, is there a way to extract logs from the 2253 and 831? will it show us anything? @Fomi @kenyu

Keep us posted on progress. To be honest we have done transmission every 10 secs for a long time, lots of frames. However this was only one node, we never had the case of many nodes. This would be interesting to see, we were expecting that things will go bad the more the nodes and frames, ahaha

Hello everyone
After a lot of testing we have come to a few conclusions:

  • the gateways have no problem processing normal packets (300.000 packets in 96hours). These were all ABP packets with “hello” in them.
  • the gateway has a problem processing OTAA joins, and therefore crashes if it receives too many.
LOG file from gateway when trying a join

24.03.19 13:32:26 (+0100) main ##### 2019-03-24 12:32:26 GMT #####
24.03.19 13:32:26 (+0100) main ### [UPSTREAM] ###
24.03.19 13:32:26 (+0100) main # RF packets received by concentrator: 2
24.03.19 13:32:26 (+0100) main # CRC_OK: 100.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
24.03.19 13:32:26 (+0100) main # RF packets forwarded: 2 (43 bytes)
24.03.19 13:32:26 (+0100) main # PUSH_DATA datagrams sent: 0 (0 bytes)
24.03.19 13:32:26 (+0100) main # PUSH_DATA acknowledged: 0.00%
24.03.19 13:32:26 (+0100) main ### [DOWNSTREAM] ###
24.03.19 13:32:26 (+0100) main # PULL_DATA sent: 0 (0.00% acknowledged)
24.03.19 13:32:26 (+0100) main # PULL_RESP(onse) datagrams received: 0 (0 bytes)
24.03.19 13:32:26 (+0100) main # RF packets sent to concentrator: 1 (0 bytes)
24.03.19 13:32:26 (+0100) main # TX errors: 0
24.03.19 13:32:26 (+0100) main # TX rejected (collision packet): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main # TX rejected (collision beacon): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main # TX rejected (too late): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main # TX rejected (too early): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main ### BEACON IS DISABLED!
24.03.19 13:32:26 (+0100) main ### [JIT] ###
24.03.19 13:32:26 (+0100) main # INFO: JIT queue contains 0 packets.
24.03.19 13:32:26 (+0100) main # INFO: JIT queue contains 0 beacons.

  • when we have a non-rak gateway online in the same vicinity the other one will process the join and afterwards the RAK will receive and process the packets from that device.

Hi @De_Drie_Bronnen,

How many nodes are we talking about sending join requests at the same time?

Thanks for your exactly testing!
We’ll debug this issue next.:handshake:

Hello

After several more weeks of testing:
The raks have issues with processing joins. they receive the join request, but do not send an acknowledge with the keys. the normal packets keep being processed at that time.
After a while the buffer of unprocessed joins fills up and it stops processing packets all together…

we would love to use the raks but for the moment we have switched our production devices to multitech conduits.

Hi, @De_Drie_Bronnen

Thank you for your testing too. It is helpful for us to improve our products user experience continuously.

Hi, any roadmap to solve this issue?

Hi, @Oskariot

We’ve tested for a long time, and find an issue which may be related with the issue you meet.
We are fixing it, and will release a new firmware this week. Then we can test again using the new firmware.

Hi, @Fomi

thank you for your information and solving the issue.

Hi @Fomi
Did you release the new firmware yet?
Thank you

Hi, @De_Drie_Bronnen

Yes, we’ve released a new version firmware recently, you can download it from RAK website:
https://www.rakwireless.com/en/download/LoRa/RAK2245-Pi-HAT#Firmware

This is the release note:

  1. Solve the problem of disconnect and reconnect network, gateway and server can not connect automatically;
  2. Add a fixed IP for LAN Port: 192.168.10.10;
  3. SSID and password can be modified in AP mode;
  4. Change rak-version to gateway-version;
  5. Modify local_conf.json, only EUI information is saved and other redundant information is deleted;