The story of how a bad random number generator can result in 3 hour announcement list delays.
Some of you may use our Announcement List feature to send e-mailing to a happy group of subscribers (happy because they all opted-in)!
Some of you may have noticed that recently it was taking a long time to send out announcements.. they’d generally go two to three hours after the time you’d scheduled them for.
Some of you may have been getting angry about this..
Please be happy, it’s fixed now!
The first few times people reported this we thought it was just a temporary problem, like maybe there were just a lot of messages going out and the servers couldn’t keep up. We weren’t able to reproduce the problem and generally if you can’t reproduce the problem it’s going to be too hard to find and fix to make it worth the effort.
Finally yesterday, we were able to reproduce the problem.. right in front of our eyes the mailing lists were going out two hours behind schedule! Hooray! Well, actually BOO! that’s bad … but also, Hooray! now we can maybe fix it!
It turned out the root of the problem was not the sending of the mailing list itself, but actually another thing the same mailing list sender script did.. send out confirmation emails to addresses people have manually added to the list from our web panel.
But first let me give you a little “Announcement List History”…
It used to be whenever somebody wanted to subscribe a bunch of people to their list from our panel, our panel would immediately attempt to send out the confirmation emails. This was fine when people were subscribing less than say, 20 addresses.. but if you tried sending hundreds of email addresses right there in real time it would take so long that the panel would usually time out in the listmaster’s web browser!
This was no good because A. it looked bad, B. not all emails would always go out, and C. people would generally get scared and re-submit their confirmation list, thereby possibly “spamming” the very people we’re trying to make sure don’t get spammed!
Soon we implemented a better way. Instead of sending all the emails immediately, we’d just INSERT them into a database and then when our script ran to send announcements it’d also send the confrimation emails based on that table!
And everything was great for a few months!
Then, strangely we started getting reports the panel was timing out AGAIN! Why, God, why?! Well, it turned out even INSERTing thousands of emails into the database was too slow for the panel (which is a bit strange, but I guess not unreasonable).
So, to fix it this time, we created a temporary table that would just immediately store the whole list of addresses in just one INSERT. Then later (during the sending) we’d break that list into it’s thousands of individual components and INSERT them into the main table. (The reason we need that table at all is to track the unique “goop” … something like 005lcw1grDw5jA … for when subscribers verify their email by clicking the link in the email they get).
Back to the present.
As it turns out, the sending of actual announcements has been getting held up by the thousands of INSERTs the script was doing to send confirmations to people being added from the web panel!
Well, the first thing we did was separate these two scripts.. there’s no reason announcements need to wait on confirmation emails! That’s just dumb. So that fixed it.. but why were these INSERTs taking so long?
After some poking around, it turned out the problem was actually that “goop” stuff! You see, we need them all to be unique, and so this is the code we were using:
## create the goop!
srand(time() ^ ($$ + ($$ << 15)) ); #gets a nice random seed. my $p = rand(); my @chars = ('a'..'z', 'A'..'Z', '0'..'9'); my ($salt) = $chars[rand($#chars)] . $chars[rand($#chars)]; ($goop) = crypt($p, $salt); } until ($db->Insert('mailinglist_approve',
Basically, we’d get some random goop, try and INSERT it into the table, and if that goop was already in there, it’d fail. Then we’d just create a new random goop and try again. Given the number of potential goops and the number of entries in the table at any given time, we should basically NEVER have to INSERT more than once. This is nice and good except for one part:
srand(time() ^ ($$ + ($$ << 15)) ); #gets a nice random seed. my $p = rand();
It turns out that because the seed is based on the current time, we were not getting a "nice random seed" every time we ran it. The time only updated once a second, and so our goop would only change once a second! That meant we would do dozens of failing INSERTs over and over and over each second until the goop finally changed. And those dozens (hundreds?) of INSERTs were making the table slowww...
After a little bit of research on better random number generating techniques, we changed that code to be:
my $p = rand(`head -1 /dev/urandom');
Which actually gives you a good random seed ALL the time. Immediately the number of INSERTs we were doing dropped exponentially and everything is now fast and happy!
And that's how a bad random number generator can result in 3 hour announcement list delays.