RAK2245/3 crashing after 68hours / 63000packets

Hello, we spoke with you at the things conference.
We have several RAK831 gateways, some of which are deployed at very high traffic locations (more than one packet per second received sometimes). They crashed every few days, we think of the heat generation.

At the things conference we got a new 2243 gateway with heatsinks so we were hoping that this would solve the problem, however after less than three days and 63000 packets the gateway crashed.
Is there any way for us to solve this? we are rolling out LoRaWAN in Belgium and would love to use your technology for this.

Thank you in advance
Ezra van Manen
De Drie Bronnen

Hi, @De_Drie_Bronnen

Have you used RAK831/2243 in another device or shell or not? If it is, can you take a photo for it, and upload here?
Thank you!

Hi! @Fomi
Thank you for your reply
No they are used in an open room, in a well ventilated environment with a 120mm fan blowing on it. the concentrator does get quite warm, but the room around is a normal temperature (10-18°C).
We are using this antenna https://www.conrad.be/p/antenne-aurel-gp-868-650200599-190123?searchSuggest=product&searchTerm=190123&searchType=suggest

The antenna is on top of a roof 40meters high. The rak with the raspberry is in a room under the roof, so not outside.

Thank you in advance
Ezra van Manen
De Drie Bronnen

@Fomi
Also the RAK831/2243 that we have outside the city (so less traffic) are fine, they are not crashing. But the one in the city are.

Hi,You are correct, RAK2245/3 improve the heat dispassion design as mentioned here:https://twitter.com/RAKwireless/status/1093369208430362624 and we also found the SX1301 hot temperature is quite impact on the system stability. Do you have enclosure for your Pi system with RAk2245 Pi hat? How did you send packet in your system and can we duplicate here in local? We would love to assist you complete the project for sure. What gateway software version you are using?

Hmm this is just a suspicion but, perhaps the issues is not in the overheating at all. More like software limitation. Can you describe what exactly do you mean by crashing every few days. Is 63000 the exact number of packets that the crash happens and does this occur every time (every crash is when 6300 packets are used).
Could you elaborate on the back-end. Are you using TTN, LoraServer, etc.
Thank you for the information in advance this seems like an interesting issues to solve.

We are planning to do something similar so it would be a good experience for us if you keep us posted on the resolution of the problem.
Ah also respect for doing the system in Belgium, go LoRaWAN :slight_smile:

Hi thank you all for your help and replies!


this shows the antenna and its position. it is the small antenna to the right under the dishantenna.

this shows the RAK831, but our new RAK2243 is in the same spot now. we replaced the 831 with this. we hoped it would solve the crashed but it didnt.

We are using the things network.
this is how much the 2243 received before going down:

Received Messages 64234

Transmitted Messages 3

we have not rebooted it yet because it is really hard to get to… thats why we installed the new one hoping it would not crash.

we are using this github repository to set up the gateway. https://github.com/RAKWireless/RAK2245-LoRaGateway-RPi-LoRa-Gateway-OS

@kenyu we have a great position, the gateway and antenna are placed on a building in the middle of the city about 40meters high. it covers so much, and receives a lot. What other information would you need from me to help you diagnose / duplicate it?

also important to note:
We used balena.io for our rak831 gateways, and we could see that when it crashed the Raspberry PI itself was still online, but at the things network it was not.
the RAK2245 does not use balena and we have no way to remote SSH to the device as of now.

thank you all!
Ezra

1 Like

image
we used this image on our 2245/3

OK now I have an idea I think.
First of all, nice location bro, I first saw the dish and I thought, wow those guys are crazy :slight_smile:
Anyways how do you know it has crashed, you stop seeing messages in TTN ?
If so than actually TTN has an issue with that I believe, after a certain number of frames it crashes unless you reset the frame counter. I am not sure it is that, you got to tell me what exactly the crashing process is, does the gateway become unresponsive, can you SSH to the PI, etc.
Furthermore I did use balen.io, but it was several months ago, it was called resin.io back than. It is very nice, with fancy interface and a nice way of making the image and you can distribute it to many gateways (probably why you chose it) however I had the same issue, gateway was crashing after some time. I solved it by instaling the packet forwarder from the RAK repo on git. I know it is a d… move, but the support guy from resion.io never answered to my query about it. I think I caught the time when they were restructuring from resin to balena and it got lost. So sorry I am not sure how to fix the balena itself.
This being said there is a reasonably large probability that it is one of the two issues I described none of which has anything to do with the hardware itself.
Are you set on using the balena.io because of the fleet distribution and updating system, or are you open to exploring new ways as for example flashing the image from the RAK831 repo here:
https://github.com/RAKWireless/RAK831-LoRaGateway-RPi

There is a decent chance this could solve it perhaps. This is if it is the balena issue.
Can you try this and see what happens.
I hope what I said make sense, please correct me if I am wrong in my assumptions.

Basically go without balena and perhaps reset the frame counter once a day if it crashes after lets say 4 days for example and see if it makes it to the fifth.
If it does we are happy cause we identified the problem, and also said cause we got no way to solve it :slight_smile:

Keep me posted please I am curious as I have considered going back to balena at some point and want to know if the issues is in there.

Hi, @De_Drie_Bronnen

You can refer to @Hobo 's experience and try to find the reason. You can also use RAK another software project of RAK2245 for testing. It is based on Raspbian OS:
https://github.com/RAKWireless/RAK2245-LoRaGateway-RPi-Raspbian-OS

BTW, i think you can use loraserver.io as NS for testing too. Actually, if you are using https://github.com/RAKWireless/RAK2245-LoRaGateway-RPi-LoRa-Gateway-OS, there is already a NS in your RAK2245 LoRa gateway.:grinning: You can use it without connecting to Internet freely. Just refer to this tutorial:
http://docs.rakwireless.com/en/LoRa/RAK2245-Pi-HAT/Application-Notes/Get_Start_with_RAK_LoRa_Develop_Kit.pdf

Hi!
Because our old gateway also crashed (in the sense that it wasn’t forwarding packets anymore, was still SSH accessible) we will start with changing our NS to loraserver.io and lets see if it can receive more packets that way without crashing. if that doesnt work we will try the integrated one. I will keep this thread updated on our progress!
thank you all

Hi Erza,

Did you however do as I recommended. Try to reset the packet count and go around Balena and flash the packet forwarder to raspbian stretch directly ?

Also you mentioned that you have not rebooted it since it is hard to get too. Both balena and raspbian stretch allow for remote reset. Perhaps try this.

Hi! I have not done this yet, I was out of the country until today. The problem is that since we haven’t used balena with the 2243 we can not remote to it, we wanted to get it up there as fast as possible and didnt configure the remote SSH correctly…
The gateway will be returned to us later this week and we will reflash the packet forwarder and install remote tools.

Hmm, what about the 831, it is using balena, right. Maybe try to reset the counter there.
But the 2243 still crashing even without Balena is not a good sign. I would love if you keep us posted. This is something I would like to see solved, I have never done more than a 30 000 packets to be honest, so perhaps I should test it myself. Got to hook up a bunch of nodes to transmit every 10 secs and see what happens.
It should work with both TTN and LoRaServer locally. I have both options, as a matter of fact on the same gateway and never had issues with one or the other, but I never had such load.
Hm… this is so interesting.

So we have the gateway back here. we are going to set up a stress test with some nodes sending packets every few seconds for days and well see when it crashes, is there a way to extract logs from the 2253 and 831? will it show us anything? @Fomi @kenyu

Keep us posted on progress. To be honest we have done transmission every 10 secs for a long time, lots of frames. However this was only one node, we never had the case of many nodes. This would be interesting to see, we were expecting that things will go bad the more the nodes and frames, ahaha

Hello everyone
After a lot of testing we have come to a few conclusions:

  • the gateways have no problem processing normal packets (300.000 packets in 96hours). These were all ABP packets with “hello” in them.
  • the gateway has a problem processing OTAA joins, and therefore crashes if it receives too many.
LOG file from gateway when trying a join

24.03.19 13:32:26 (+0100) main ##### 2019-03-24 12:32:26 GMT #####
24.03.19 13:32:26 (+0100) main ### [UPSTREAM] ###
24.03.19 13:32:26 (+0100) main # RF packets received by concentrator: 2
24.03.19 13:32:26 (+0100) main # CRC_OK: 100.00%, CRC_FAIL: 0.00%, NO_CRC: 0.00%
24.03.19 13:32:26 (+0100) main # RF packets forwarded: 2 (43 bytes)
24.03.19 13:32:26 (+0100) main # PUSH_DATA datagrams sent: 0 (0 bytes)
24.03.19 13:32:26 (+0100) main # PUSH_DATA acknowledged: 0.00%
24.03.19 13:32:26 (+0100) main ### [DOWNSTREAM] ###
24.03.19 13:32:26 (+0100) main # PULL_DATA sent: 0 (0.00% acknowledged)
24.03.19 13:32:26 (+0100) main # PULL_RESP(onse) datagrams received: 0 (0 bytes)
24.03.19 13:32:26 (+0100) main # RF packets sent to concentrator: 1 (0 bytes)
24.03.19 13:32:26 (+0100) main # TX errors: 0
24.03.19 13:32:26 (+0100) main # TX rejected (collision packet): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main # TX rejected (collision beacon): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main # TX rejected (too late): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main # TX rejected (too early): 0.00% (req:1, rej:0)
24.03.19 13:32:26 (+0100) main ### BEACON IS DISABLED!
24.03.19 13:32:26 (+0100) main ### [JIT] ###
24.03.19 13:32:26 (+0100) main # INFO: JIT queue contains 0 packets.
24.03.19 13:32:26 (+0100) main # INFO: JIT queue contains 0 beacons.

  • when we have a non-rak gateway online in the same vicinity the other one will process the join and afterwards the RAK will receive and process the packets from that device.