Some things I noticed while testing BRIstuff for * 1.4.31


During the last week i spent a lot of time debugging Asterisk 1.4.31 (BRIstuffed and vanilla). The locking “model” is really … well…let’s say “interesing”. Different threads lock mutexes in different order at quite a few places, which would normally result in insta-deadlocks. This is where theĀ  deadlock avoidance (lock.h) kicks in.

When * fails to aquire a lock (with “ast_mutex_trylock”) while it is holding another lock, it will unlock the held lock, sleep for 1 microsecond and re-lock the lock it held before. This loops until it finally manages to lock the first lock:

while (ast_mutex_trylock(&lock1)) {

ast_mutex_unlock(lock2);
usleep(1);
ast_mutex_lock(lock2)

}

Oh, yes, sometimes it NEVER aquires the first lock! Which turns this “avoided deadlock” into a 100% cpu hog! And it makes it so much harder to actually find a race condition in the code…

In case you are wondering why your D channels sometimes go down for no reason (but might recover after some time) or why your Asterisk process is eating all your CPU cycles although it is only pushing very few calls, then you might have hit an “avoided deadlock”. You will mostlikely experience a degrading voice quality in this case, too.

If you happen to run Asterisk with realtime scheduling priority then your userspace will most likely be gone! You can still ping your machine but cannot login neither remotely nor locally. No, your kernel did not crash, also neither your RAM is dodgy nor your shiny quadBRI card. ;-)

I managed to “fix” some places in chan_dahdi which used the wrong locking order, but there are still a few remaining places which will need much testing after the locking order has been resolve, for example:

If a DAHDI channel receives events from the ISDN, it will have the pri->lock aquired and then will try to aquire the pvt->lock and channel->lock. On the other hand if the Asterisk core calls a function (ast_answer, ast_hangup…) on a DAHDI channel it will call it with the channel->lock locked and will try to lock the pvt->lock and then the pri->lock .

When both things happen at the same time we would have a deadlock (without the “deadlock avoidance”). With the “deadlock avoidance” there is a chance to create an infinite loop. If the Asterisk system is not pushing many calls then the probability of such a loop increases significantly because there will not be many other threads “disturbing” the steady timing of the two threads (they both always sleep for 1 microsecond and most likely will always be scheduled in the same order!). On a loaded Asterisk system the probability of such an event is much lower.

Maybe it might be a good idea to add a little randomness to the sleeping interval. For development and testing I will remove the “while-try-lock” loops and use the deadlocking regular ast_mute_lock function instead.