ROBIN -  Open Source Mesh Network Forum Index ROBIN - Open Source Mesh Network
users community forum
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

r2680: busybox traceroute sometimes hangs (OM1P)

 
Post new topic   Reply to topic    ROBIN - Open Source Mesh Network Forum Index -> beta-1.5 Release Candidates
View previous topic :: View next topic  
Author Message
Ads






Posted: Thu Nov 23, 2017 5:31 am    Post subject: Ads

Back to top
duncanSF
User
User


Joined: 13 Mar 2009
Posts: 49
Location: California

PostPosted: Mon Dec 28, 2009 10:26 am    Post subject: r2680: busybox traceroute sometimes hangs (OM1P) Reply with quote

I have three nodes now running r2680.

One of them is having trouble checking in. I've traced the problem to update calling speed-test which sources /lib/robin/last-hop.sh which executes:
Code:
traceroute -m1 -n 1.2.3.4


Without looking up hostnames, trace the route to some bogus host on the Internet looking outward at most one hop.

On the nodes where this works it returns within 15 seconds: it sends three probes and waits five seconds for each reply. Usually this takes a few milliseconds. When one probe is lost it takes about five seconds.

On the node where this often doesn't work, traceroute doesn't return until killed. This stalls the update script at the test trying to identify the gateway. The node does not check in. The process table fills with traceroute, awk and grep and after a few hours the node reboots with the excuse 'low memory'.

A binary comparison between busybox on the working and non-working nodes showed they are identical. The environment reported by 'set' at the busybox ash prompt showed no difference either. There is no busybox.conf on either node.

Code:
root@A:~# free
              total         used         free       shared      buffers
  Mem:        30060        19464        10596            0         1688
 Swap:            0            0            0
Total:        30060        19464        10596
root@A:~# time traceroute -m1 -n 1.2.3.4
traceroute to 1.2.3.4 (1.2.3.4), 1 hops max, 38 byte packets
 1  5.204.89.186  9.364 ms  6.163 ms  7.459 ms
real    0m 0.06s
user    0m 0.00s
sys     0m 0.02s
root@A:~# time traceroute -m1 -n 1.2.3.4
traceroute to 1.2.3.4 (1.2.3.4), 1 hops max, 38 byte packets
 1  5.204.89.186  5.643 ms  5.274 ms  5.076 ms
real    0m 0.06s
user    0m 0.00s
sys     0m 0.05s
root@A:~# time traceroute -m1 -n 1.2.3.4
traceroute to 1.2.3.4 (1.2.3.4), 1 hops max, 38 byte packets
 1  5.204.89.186  5.048 ms  5.132 ms  10.863 ms
real    0m 0.05s
user    0m 0.00s
sys     0m 0.04s
root@A:~# time traceroute -m1 -n 1.2.3.4
traceroute to 1.2.3.4 (1.2.3.4), 1 hops max, 38 byte packets
 1  5.204.89.186  7.927 ms  7.350 ms^CCommand terminated by signal 2
real    9m 53.56s
user    0m 0.03s
sys     0m 0.04s

After nearly ten minutes the traceroute invocation had not returned.


This seems a rare case.

I don't know how the node got that way but it has kept the problem over several reboots, both with and without wired clients connected to its network port. The problem has not followed those clients, which I moved to another node. I could probably flash it and bury the issue . . . but it got that way somehow and I suspect others may see the same thing.

I suspect it's a bug in busybox, though I don't know what tickles it or why it's affecting just one node.

Suggestions?
Back to top
View user's profile Send private message
duncanSF
User
User


Joined: 13 Mar 2009
Posts: 49
Location: California

PostPosted: Tue Dec 29, 2009 12:53 am    Post subject: Reply with quote

My node 'B' is now behaving as 'A' did yesterday: traceroute waits forever instead of for just a few seconds when a probe is lost. 'A' retains the fault, even after a hard power cycle.

The more I ponder that series of scripts the more I dislike it, especially last-hop.sh.

It depends on busybox traceroute, which sometimes just hangs.

It depends on traceroute returning an IP in the second column. A valid response in that position is an asterisk.

It presumes participating routers will have addresses in 5.0.0.0/8. The local web configuration of version r2671 allows for choice of something other than 5 in that position. Meraki chose 6. The folks at IANA would chant 10.

Even if the constant 5 were replaced by a function of the node's own IP there is no guarantee the gateway agrees. It doesn't even have to be on the same administrative network if strict meshing is unset. Without such a filter it becomes impossible to determine the last participating router by looking at traceroute's output.

It makes two route-related queries. One of them requires the standard kernel routing table to contain a sensible default route, which, if I recall, is not the case for layer-3 BATMAN. Another looks like an explicit call to olsrd with netcat. Both of these tend toward locking ROBIN to OLSR which feels like a step in the wrong direction or at least a step away from modular routing.

/lib/robin/last-hop.sh contains a sourceable shell script function getLastHop(). There is no comment documentation giving the arguments the function expects, the format of any value it could return, a list of shell variables the function might read, set or clobber or a list of files used or modified by the function.

It deletes /tmp/topology_table at one point and replaces it several seconds or minutes later, if it gets that far. Does anything else expect /tmp/topology_table to exist?

At the end of the day getLastHop()'s purpose is to give speed-test a target from which to pull a megabyte of scratch data with wget, as a local speed test. In an environment where a 1mbps signal path can be chosen and may even be best, the speed test busies the single channel used by all nearby nodes for more than ten seconds, which is disruptive to the end user who may be playing an interactive game or trying to hold a telephone conversation over the link. Repeat that every five minutes and for every node on the wrong side of a weak link and the instrumentation sinks the ship.

The last observation leads me to hold off on optimization of details within the loops. My need may be served better by simply shutting off some of the instrumentation.


Only registered users can see links on this forum!
Register or Login on forum!


image: sporadic checkins for nodes exhibiting this problem.
network name: waves2
Back to top
View user's profile Send private message
ispyisail
Site Admin
Site Admin


Joined: 12 Sep 2008
Posts: 4604
Location: New Zealand

PostPosted: Tue Dec 29, 2009 4:02 am    Post subject: Reply with quote

I would suggest you sign up to the mailing list and submit your findings or a summery with a link to the post.

Only registered users can see links on this forum!
Register or Login on forum!



Antonio, phred, etc appear to be moving more in that direction for more technical posts like this one.

Basic checks

-Wifi/wireless interference?
-Nodes to far apart?

_________________
ROBIN-Mesh Wiki:

Only registered users can see links on this forum!
Register or Login on forum!

Test Network:
Only registered users can see links on this forum!
Register or Login on forum!



Please donate to ROBIN by paypal:

Only registered users can see links on this forum!
Register or Login on forum!

!
Back to top
View user's profile Send private message
foxtroop11
Service Provider
Service Provider


Joined: 22 Mar 2009
Posts: 1168
Location: Ansbach, Germany and sometimes the States

PostPosted: Tue Dec 29, 2009 8:14 am    Post subject: Reply with quote

Thanks for pointing this out. I'm wondering if I was seeing the same problem at a motel I setup. One node would show the same type of checkin data you show. I thought the router was just to far away or something, now I'm thinking maybe it's fine and just not checking in.

Quote:
Both of these tend toward locking ROBIN to OLSR which feels like a step in the wrong direction or at least a step away from modular routing.


As far as this goes, new Robin firmware doesn't even include Batman nor is the option on the Open-Mesh dash to change the routing choice. While having options is great e.g LWP 5.xx changes Nodog/Coova, it just appears to me it makes it harder to maintain and really just gets people bogged down in trying to maintain 20 differnet ways to do something instead of really fine tuning what we actually need/desire in terms of the overall usage.
Back to top
View user's profile Send private message
duncanSF
User
User


Joined: 13 Mar 2009
Posts: 49
Location: California

PostPosted: Thu Dec 31, 2009 12:52 am    Post subject: Reply with quote

Then the fix to the problem spawning this thread is to avoid traceroute.

We could
  • alter olsrd.conf to accept info queries from hosts other than just localhost
  • replace traceroute in last-hop.sh with direct queries to remote instances of OLSR until we reach a router that won't answer our query.

Benefits:
  • we avoid busybox's buggy traceroute.


Consequences:
  • the answer may be wrong. It may be some router between us and the actual gateway.
  • this method depends on nc which, like traceroute, is implemented in busybox.

I suppose we could do the same thing with wget after first removing the password requirement from the needed page [http://node:8080/cgi-bin/mesh.cgi].

Benefits:
  • this will be able to learn routes through r2680 and r2671.

Consequences:
  • possibly higher memory usage on the probing node during successful runs.
  • higher memory and cpu usage on upstream nodes.

We should probably make mesh.cgi accessible without access restrictions. Really.

That rather heavyhanded solution still doesn't take busybox's network code out of the picture entirely: the cgi script launched by the upstream webserver in response to the request made by wget uses busybox's nc. nc is not confirmed to share traceroute's problem. Failure is also pretty unlikely given that nc's target is localhost (by number). httpd, it turns out, is also busybox.

I nominate myself for rewriting the affected files.

ispyisail: I certainly have wireless interference here. I have several neighbors using my gear (open-mesh and meraki) to browse youtube, one or two using it with Netflix, several nearby neighbors using the same channels, several amplifled business neighbors defeating collision avoidance by running on the "off" channels (not 1, 6 and 11 but instead 2, 5 and 7). I'm also stress testing the routing by deliberately introducing hidden node interference and asymetrical signal qualities with my placement choices.

That said, a few dropped frames should not stall any script, especially in a way that doesn't get cleaned up by another process.

foxtroop11: Yes, the nodes failing to check in were, for the most part, useable during the reported outages. They would reboot every few hours which is disruptive to associated users but they weren't as dead as they looked.

The release of r2671, which dropped batmand, came just a few weeks after I'd compared the performance of OLSR and BATMAN on r1523 and found BATMAN to perform more reliably in the presence of a weak alternate path. I'm testing improvements to olsr's configuration, anticipating the major change batman-adv would bring, and hanging onto r1523 as a personal fallback.

I do hope NoDogSplash isn't dropped anytime soon. I looked into the supported alternatives for access control a few months ago and found all of them to be either ridiculously complicated in their business terms, obscenely expensive or both. I judged that using any of them to assist in the collection of money from my users would cost me many times what they actually pay. NDS allows me to introduce myself every few days and gives users a paypal address and my price. Enough people contribute that I'm comfortable leaving it that way.
Back to top
View user's profile Send private message
foxtroop11
Service Provider
Service Provider


Joined: 22 Mar 2009
Posts: 1168
Location: Ansbach, Germany and sometimes the States

PostPosted: Thu Dec 31, 2009 1:27 am    Post subject: Reply with quote

Sounds good, hope you can come up with something. I hate to keep bringing it up, and i'm sure people are getting tired of hearing it, when I'm getting 20+Mb across the mesh alittle script like this doing a check isn't even noticed nor do I even need batmand as olsr is holding up strong on dual radio. Improvments are always good though. As far as nodog splash, something just as easy is right there on the dash. I intended to test it tonight but got busy. Just try the new Coova branch and the manual/legacy option on Open-Mesh dash to pick the coova.org setting. Looks to operate just like nodog without all the hassle and relance on older kernels. I've been saying this for awhile, but this will pave the way for new and exciting stuff. If someone comes along and fixes NDS, great, still very useful.


Take a look at omextreme on the dash. Both are turned off now, but I think looking at the numbers you'll get the idea.
Back to top
View user's profile Send private message
phred
Moderator
Moderator


Joined: 01 Jul 2008
Posts: 207
Location: San Francisco

PostPosted: Thu Dec 31, 2009 2:47 am    Post subject: Reply with quote

duncanSF wrote:

/lib/robin/last-hop.sh contains a sourceable shell script function getLastHop(). There is no comment documentation giving the arguments the function expects, the format of any value it could return, a list of shell variables the function might read, set or clobber or a list of files used or modified by the function.


You're right, there isn't much documentation in there, but I added several comments in last-hop.sh when I refactored it a couple months ago so that it returned in under 5 seconds for > 100 nodes. Right now the file is about 25% comments:

Code:

root@host:~# cat /lib/robin/last-hop.sh  | grep '#' | wc -l
16
root@host:~# cat /lib/robin/last-hop.sh   | wc -l         
63


duncanSF wrote:

It deletes /tmp/topology_table at one point and replaces it several seconds or minutes later, if it gets that far. Does anything else expect /tmp/topology_table to exist?


Yes, but I can't remember where right now. grep is your friend though Smile

duncanSF wrote:

At the end of the day getLastHop()'s purpose is to give speed-test a target from which to pull a megabyte of scratch data with wget, as a local speed test. In an environment where a 1mbps signal path can be chosen and may even be best, the speed test busies the single channel used by all nearby nodes for more than ten seconds, which is disruptive to the end user who may be playing an interactive game or trying to hold a telephone conversation over the link. Repeat that every five minutes and for every node on the wrong side of a weak link and the instrumentation sinks the ship.


I had these same concerns initially, but performance testing assured me that the problem is not as bad as you are alluding to here. The traceroute does generate UDP traffic, but not enough to disturb OLSR's routing tables unless you have an especially bad link. The changes I committed recently for 2768 remove IPV6 queries from wget, which was causing some issues with OLSR's routing measurements, but the current version (especially since I added olsr-0.5.6-r8-pre) is very performant and stable.

I agree the traceroute isn't the best solution there, but it is 'good enough' for non-enterprise use right now. I thought that it might be possible to gauge the current route from the OLSR table, but it is difficult to do with shell scripts. I had a hard time doing it using Perl, Python, and Ruby Smile

duncanSF wrote:


Only registered users can see links on this forum!
Register or Login on forum!


image: sporadic checkins for nodes exhibiting this problem.
network name: waves2


Something is very wrong with your links in that image, but it looks like your current status is ok. I didn't know you had a setup in Alameda, good to know next time I take the ferry!

duncanSF wrote:

I do hope NoDogSplash isn't dropped anytime soon


That's unlikely, as it is currently the most stable and also the fastest captive portal around. I pushed a couple of changes recently that resolve issues which were occurring when a lot of users connected at once. The author Paul Kube is a great guy and is still active in developing it, I've sent him several patches we contributed to ROBIN.

_________________

Only registered users can see links on this forum!
Register or Login on forum!



Advertising and Commercial Grade Solutions for Open-Mesh Networks
Silver Lining Networks is development contributor to the ROBIN firmware
Back to top
View user's profile Send private message Visit poster's website
foxtroop11
Service Provider
Service Provider


Joined: 22 Mar 2009
Posts: 1168
Location: Ansbach, Germany and sometimes the States

PostPosted: Fri Jan 01, 2010 1:43 pm    Post subject: Reply with quote

Just looking at a couple networks, one I setup and the other belongs to another user.

Only registered users can see links on this forum!
Register or Login on forum!



I'm not able to go on site to really see what's going on, but both networks have something strange going on like your's. All the nodes have good LOS to each other with strong signal, yet some show on the Open-Mesh dash as intermittant operartion. I don't know if I should belive what the dash says or not.

One feature, which should probably go under the wish list, is the ability for the node to report to the dash the last Reboot reason. There could be alittle space next to each node and if it has a number next to it you could determine if it really had rebooted and what the reason was.
Back to top
View user's profile Send private message
foxtroop11
Service Provider
Service Provider


Joined: 22 Mar 2009
Posts: 1168
Location: Ansbach, Germany and sometimes the States

PostPosted: Fri Jan 01, 2010 6:52 pm    Post subject: Reply with quote

After watching a network in question on and off for a couple hours it looks like your on to something here.

I'm looking at two nodes with good LOS and signal between one another. By looking at the dash alone it appears that the repeater is rebooting every 7-12 minutes. It will checkin, then it will say uptime roughly 9 minutes or so. I'll then wait, sometimes the next checkin occurs, sometimes it occurs alittle late, and then most of the time it occurs really late. I'll then look over at the uptime and it will be back to around 9 minutes or so. The bar graph on the right of the Open-Mesh dash looks almost all filled in with green, with just alittle sporadic outages. I wish I could see what is exactly going on, but now that you have brought this issue to light I'm thinking it might be happening to me as well.

Until I can gain access to this repeater node it's all a guessing game. I just find it hard to beleive it's rebooting over and over again. I'm wondering if this script and checkin process if failing, would somehow reset the uptime, or does that only occur upon a reboot?
Back to top
View user's profile Send private message
ispyisail
Site Admin
Site Admin


Joined: 12 Sep 2008
Posts: 4604
Location: New Zealand

PostPosted: Fri Jan 01, 2010 7:01 pm    Post subject: Reply with quote

Quote:
One feature, which should probably go under the wish list, is the ability for the node to report to the dash the last Reboot reason. There could be alittle space next to each node and if it has a number next to it you could determine if it really had rebooted and what the reason was.



Only registered users can see links on this forum!
Register or Login on forum!



All ready done

_________________
ROBIN-Mesh Wiki:

Only registered users can see links on this forum!
Register or Login on forum!

Test Network:
Only registered users can see links on this forum!
Register or Login on forum!



Please donate to ROBIN by paypal:

Only registered users can see links on this forum!
Register or Login on forum!

!
Back to top
View user's profile Send private message
foxtroop11
Service Provider
Service Provider


Joined: 22 Mar 2009
Posts: 1168
Location: Ansbach, Germany and sometimes the States

PostPosted: Fri Jan 01, 2010 7:23 pm    Post subject: Reply with quote

lol, can't beleive I didn't notice that. Thanks for pointing that out! The node shows -52 which leads me to beleive it's actually not rebooting. I need to look at these scripts in question. I'm guessing if the node does not check in successful then the uptime is reset, but I'm not for sure. Either way the node does not actually look like it's rebooting every 7-10 minues as -52 would make no sense since it's a repeater and not a gateway, unless something is plugged into it's wan port confusing it?
Back to top
View user's profile Send private message
bconverse
Moderator
Moderator


Joined: 07 Mar 2008
Posts: 848
Location: Little Rock, AR (USA)

PostPosted: Sat Jan 02, 2010 2:26 am    Post subject: Reply with quote

I have seen repeaters *think* they are a gateway sometimes. Also, I have seen a repeater show another repeater as its gateway.
Back to top
View user's profile Send private message
duncanSF
User
User


Joined: 13 Mar 2009
Posts: 49
Location: California

PostPosted: Mon Jan 04, 2010 12:40 am    Post subject: Reply with quote

I'm stalled on testing this: whatever makes traceroute fail to return when a probe brings no reply has to have some trigger but I haven't found it yet, other than by mistake.

the reboot reason is reported to dashboard and displayed as a negative number below uptime in a column whose heading doesn't mention it. Good to know. Thank you.

From
Only registered users can see links on this forum!
Register or Login on forum!

I see a checkin graph that looks like mine only on node "Super 9 Motel 3". That one reports "-22" where the others report "-91" (most of them) and "-35".

Table lookup:
22: low memory
35: captive portal failure
52: restore gateway role
91: reboot needed after update

low memory. Same as mine. Shell scripts pile up waiting for the traceroute-awk-grep pipe and eventually the available memory drops below a setpoint.

foxtroop11: a node that has failed to check in is reported with whatever uptime it last volunteered. I shut one off ten days ago and that node is being reported with an uptime of five days.

The physical configuration that coincided with the traceroute failure was to have an auto-negotiating d-link fast ethernet switch connected to the network port of a repeater with a Windows Vista laptop and an SNOM IP phone plugged into the switch.

The phone doesn't work connected that way, though it does work when plugged into the laptop through the switch with the laptop offering DHCP with ICS using a wireless link to the same OM1P as its Internet connection. The nodes can handle the phone's traffic but seem to flub the DHCP exchange. More on that in another thread after more investigation.

And yes, phred, I know the reported download-from-gateway speeds are low in my setup. I'm refraining from being bothered by that until I have a homogeneous set of nodes in a path back to the upstream router outside the mesh. That will happen when I have a dry day (both sites are exposed) and time to play. For now the route being chosen is often the worst available.

My real users have been choosing a node currently named '

Only registered users can see links on this forum!
Register or Login on forum!

' just one hop from its gateway. Last and average speeds are 5.9 and 4.5 Mbps, both several times higher than my DSL provider actually delivers.
Back to top
View user's profile Send private message
foxtroop11
Service Provider
Service Provider


Joined: 22 Mar 2009
Posts: 1168
Location: Ansbach, Germany and sometimes the States

PostPosted: Mon Jan 04, 2010 7:33 am    Post subject: Reply with quote

Quote:
From Super 9 I see a checkin graph that looks like mine only on node "Super 9 Motel 3". That one reports "-22" where the others report "-91" (most of them) and "-35".


Your right, I thought -22 was something about wifi, that's 23. It would be nice if I could confirm it's doing the same thing as what your takling about. The thing about uptime, so regardless if the node fails to checking say once or twice, the next time it does the uptime should reflect the overall time up? I'm watching this one node on another network that appears to be rebooting every 10 mins or so non stop, atleast that's what it looks like when looking at the dash.
Back to top
View user's profile Send private message
Antonio (isleman)
Site Admin
Site Admin


Joined: 10 Feb 2008
Posts: 2323
Location: Toscana, Italy

PostPosted: Mon Jan 04, 2010 1:01 pm    Post subject: Reply with quote

hmm... I'm again working at last-hop-sh and a first change might be the use of the "link" table instead of "topology" table:
from
Code:

echo '/topo' | nc 127.0.0.1 8090 |awk 'FNR>5' | grep $MYIP | grep $NEXT_HOP > /tmp/topology_table
while read TOPOLOGY_TABLE_RECORD ; do
   if [ "$(echo $TOPOLOGY_TABLE_RECORD |awk '{print $1}')" == "$MYIP" ] ; then
      LQ=$(echo $TOPOLOGY_TABLE_RECORD | awk '{ print $3 }')
      NLQ=$(echo $TOPOLOGY_TABLE_RECORD | awk '{ print $4 }')
   fi
done < /tmp/topology_table

to
Code:
     
LINK=$(echo '/link' |nc 127.0.0.1 8090 |awk 'FNR>5' |grep $NEXT_HOP)
LQ=$(echo $LINK |awk '{ print $3 }')
NLQ=$(echo $LINK |awk '{ print $4 }')   
 
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Antonio (isleman)
Site Admin
Site Admin


Joined: 10 Feb 2008
Posts: 2323
Location: Toscana, Italy

PostPosted: Tue Jan 26, 2010 3:41 pm    Post subject: Reply with quote

@duncanSF

this is my idea:

- assume that the nodes store their default route IP into the file /www/next_hop (via an N-seconds scheduled job)

- the code to get the last hop IP could be the following:

Code:
hop_count=0
NEXT_HOP=$(ip route show |grep -i 'default' |awk '{print $3}')
[ -z $NEXT_HOP ] && exit 1

max_hops=5
while [ "$hop_count" -le "$max_hops" ] ; do
      echo $NEXT_HOP | grep -q '^5\.' || { LAST_HOP=$NODE_IP; break; }
   NODE_IP=$NEXT_HOP
     wget -q -O /var/next_hop "http://${NEXT_HOP}:8080/next_hop" && NEXT_HOP=$(cat /var/next_hop)
   hop_count=$((hop_count + 1))
done

echo $LAST_HOP


ie, one node follows the "chain" asking for next hop at every hop...
It works for me, and this way we could eliminate the need of traceroute.

thoughts?
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    ROBIN - Open Source Mesh Network Forum Index -> beta-1.5 Release Candidates All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
c d
e



Powered by phpBB © 2001, 2005 phpBB Group

Abuse - Report Abuse - TOS & Privacy.
Powered by forumup.it free forum, create your free forum! Created by Hyarbor & Qooqoa
Confirmed

Page generation time: 0.19