onethumb
Oct-08-2006, 11:32 PM
First of all, let me apologize for all of the problems yesterday and today with uploading (well, ok, so mostly it was with processing, not uploading, but to you, the customer, it's pretty much one and the same). We don't have any excuse, since this really shouldn't have happened, but I can at least fill you in on what's going on behind the scenes here.
Our upload queue typically hovers around 50 images pending, and we do parallel processing of images to the tune of dozens every second, on average, depending on the resolution, whether their colorspace needs to be converted to sRGB, etc. On a really heavy day, it may get as high as 1000.
Tonight it peaked at well over 60,000 images waiting to be processed.
We caught it "early", well below 10,000, around noon or so today. I sighed and moaned, then settled in to find whichever stupid server was lagging, determined to be done quickly so I could spend the day with my twins.
A few hours later, the queue was at 30,000 and I was no closer to finding the source of the problem. All servers looked good, things were processing, and I couldn't detect anything too major.
My first thought was that our new software releases on Thursday and Friday were the cause of the problem, especially because a major part of Thursday's release was an overhaul of our uploading backend to provide better error logging, processing, and reliability to our customers. Of course, the whole goal was to make things better, not worse, and our extensive internal testing had shown a dramatic improvement. Nonetheless, I spent quite a bit of time mucking around in the new release, benchmarking things and looking for obvious pitfalls. After a few hours, I came up empty - things seemed to be working as designed.
One of our master databases was under fairly heavy stress, but that's been on our radar for awhile now. We have two new boxes nearly ready to go to take the load, and even loaded like it was, it shouldn't cause these sorts of issues. I spent a few hours logging, tuning, and adjusting code just in case. It helped unload the DB box some, but image processing was still bogging down. It helped get the queue down from 32,000 to about 22,000, which was nice, but certainly no-where near acceptable.
Then BOOM! Someone or someones, I haven't gotten a chance to find out who yet, uploading more than 30,000 new images in the space of a few minutes, and the queue went north again - up to 60,000! This happens periodically, someone at Google or Yahoo or Microsoft with a fast connection can shove stuff down the pipe in a hurry. Usually it only takes us a few minutes to handle the load and move on - but not tonight.
By now, it was after 10pm and I felt no closer to our goal. The Master DB box was basically unloaded by this time, which validated that it wasn't the root cause of the problem. A possible contributing factor, still, sure, but not the root.
And then it hit me. I was looking at the problem all wrong - we'd benchmarked the new code as best we could on our internal test servers, but we didn't have a load like this. More than 300,000 images had passed through our uploading queue today. It could be a teeny, tiny slowdown that, when multiplied by hundreds of thousands, turned out to be huge.
It was. Just like the proverbial hackers who steal just fractions of a cent out of everyone's bank accounts, but still manage to get rich, we were dealing with fractions of a second here. I made one tiny, stupid, silly mistake and it caused a tenth of a second or so of extra delay in processing. Do some quick math, and a tenth of a second per photo for 300,000 photos is more than 8 hours of wasted CPU time. Yikes!
What was it? It was the simplest thing. The worst ones always are. Instead of reading the newly uploaded Original from our local, fast in-house storage, I was accidentally reading it from our storage cloud at Amazon using S3 first. Worse, since it was a brand new upload, it hadn't been stored at Amazon yet. Basically, our servers were going all the way to Seattle, asking for a photo, being told it wasn't on Amazon yet, and then they finally turned around and asked the server two feet away here in Silicon Valley.
So I believe it's fixed. We have a huge queue still (it was at 60,000 when I started writing this post, and it's now down to 40,000, so we're making fast progress), so I'm afraid you'll have to wait a little bit longer for all your photos to finish, but it looks like we're well on our way.
I'm not going to discount the possibility that I simply got lucky and everyone suddenly stopped uploading at the exact same instant I found my supposed fix, but it make so much sense I'm hopeful. :) We'll find out for sure tomorrow. Back to the drawing board, it not - so keep those fingers crossed.
As a nice side-effect, searching is now much much faster than it was (go give it a whirl), and some other portions of the site got some optimizations too.
Thanks for being so patient, I know how frustrating it can be not to have something "just work." We truly do have the best customers in the world.
I promise, even if similar problems do crop up in the future, we'll do everything humanly possible to work on a fix and get things running smoothly again - weekends, holidays, whatever it takes.
Don
Our upload queue typically hovers around 50 images pending, and we do parallel processing of images to the tune of dozens every second, on average, depending on the resolution, whether their colorspace needs to be converted to sRGB, etc. On a really heavy day, it may get as high as 1000.
Tonight it peaked at well over 60,000 images waiting to be processed.
We caught it "early", well below 10,000, around noon or so today. I sighed and moaned, then settled in to find whichever stupid server was lagging, determined to be done quickly so I could spend the day with my twins.
A few hours later, the queue was at 30,000 and I was no closer to finding the source of the problem. All servers looked good, things were processing, and I couldn't detect anything too major.
My first thought was that our new software releases on Thursday and Friday were the cause of the problem, especially because a major part of Thursday's release was an overhaul of our uploading backend to provide better error logging, processing, and reliability to our customers. Of course, the whole goal was to make things better, not worse, and our extensive internal testing had shown a dramatic improvement. Nonetheless, I spent quite a bit of time mucking around in the new release, benchmarking things and looking for obvious pitfalls. After a few hours, I came up empty - things seemed to be working as designed.
One of our master databases was under fairly heavy stress, but that's been on our radar for awhile now. We have two new boxes nearly ready to go to take the load, and even loaded like it was, it shouldn't cause these sorts of issues. I spent a few hours logging, tuning, and adjusting code just in case. It helped unload the DB box some, but image processing was still bogging down. It helped get the queue down from 32,000 to about 22,000, which was nice, but certainly no-where near acceptable.
Then BOOM! Someone or someones, I haven't gotten a chance to find out who yet, uploading more than 30,000 new images in the space of a few minutes, and the queue went north again - up to 60,000! This happens periodically, someone at Google or Yahoo or Microsoft with a fast connection can shove stuff down the pipe in a hurry. Usually it only takes us a few minutes to handle the load and move on - but not tonight.
By now, it was after 10pm and I felt no closer to our goal. The Master DB box was basically unloaded by this time, which validated that it wasn't the root cause of the problem. A possible contributing factor, still, sure, but not the root.
And then it hit me. I was looking at the problem all wrong - we'd benchmarked the new code as best we could on our internal test servers, but we didn't have a load like this. More than 300,000 images had passed through our uploading queue today. It could be a teeny, tiny slowdown that, when multiplied by hundreds of thousands, turned out to be huge.
It was. Just like the proverbial hackers who steal just fractions of a cent out of everyone's bank accounts, but still manage to get rich, we were dealing with fractions of a second here. I made one tiny, stupid, silly mistake and it caused a tenth of a second or so of extra delay in processing. Do some quick math, and a tenth of a second per photo for 300,000 photos is more than 8 hours of wasted CPU time. Yikes!
What was it? It was the simplest thing. The worst ones always are. Instead of reading the newly uploaded Original from our local, fast in-house storage, I was accidentally reading it from our storage cloud at Amazon using S3 first. Worse, since it was a brand new upload, it hadn't been stored at Amazon yet. Basically, our servers were going all the way to Seattle, asking for a photo, being told it wasn't on Amazon yet, and then they finally turned around and asked the server two feet away here in Silicon Valley.
So I believe it's fixed. We have a huge queue still (it was at 60,000 when I started writing this post, and it's now down to 40,000, so we're making fast progress), so I'm afraid you'll have to wait a little bit longer for all your photos to finish, but it looks like we're well on our way.
I'm not going to discount the possibility that I simply got lucky and everyone suddenly stopped uploading at the exact same instant I found my supposed fix, but it make so much sense I'm hopeful. :) We'll find out for sure tomorrow. Back to the drawing board, it not - so keep those fingers crossed.
As a nice side-effect, searching is now much much faster than it was (go give it a whirl), and some other portions of the site got some optimizations too.
Thanks for being so patient, I know how frustrating it can be not to have something "just work." We truly do have the best customers in the world.
I promise, even if similar problems do crop up in the future, we'll do everything humanly possible to work on a fix and get things running smoothly again - weekends, holidays, whatever it takes.
Don