Options

Site down, Thursday 11:43 am ET

AndyAndy Registered Users Posts: 50,016 Major grins
edited January 18, 2007 in SmugMug Support
Our engineers are aware, and working on it. Will post updates here. Sorry for the hassle.

UPDATE: And at 725pm ET we are down again, Onethumb is on the case... stay tuned.

Comments

  • Options
    nedensnedens Registered Users Posts: 25 Big grins
    edited January 18, 2007
    smugmug down again?
    I keep getting a
    "Http/1.1 Service Unavailable" page.

    I understand that things do happen but since I singed up we have had 3 different sets of down time.

    It makes using smugmug for business very very frusterating.


  • Options
    jfriendjfriend Registered Users Posts: 8,097 Major grins
    edited January 18, 2007
    Andy, I presume you mean Thurs.
    --John
    HomepagePopular
    JFriend's javascript customizationsSecrets for getting fast answers on Dgrin
    Always include a link to your site when posting a question
  • Options
    AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    nedens wrote:
    I keep getting a
    "Http/1.1 Service Unavailable" page.

    I understand that things do happen but since I singed up we have had 3 different sets of down time.

    It makes using smugmug for business very very frusterating.


    Yep and we know it. We're really sorry. I'm guessing it's related to the previous issue, and I'm sure that onethumb and wireless will have it up and going in no time.


    http://www.dgrin.com/showpost.php?p=462722&postcount=52
  • Options
    AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    jfriend wrote:
    Andy, I presume you mean Thurs.
    fixed thread title, sorry and thanks!
  • Options
    nedensnedens Registered Users Posts: 25 Big grins
    edited January 18, 2007
    Andy wrote:
    Yep and we know it. We're really sorry. I'm guessing it's related to the previous issue, and I'm sure that onethumb and wireless will have it up and going in no time.


    http://www.dgrin.com/showpost.php?p=462722&postcount=52

    Don't mean to sound ungratefull as I do appreciate you all catching it very quickly and your punctuality in notify people. It really sounds like you all don't normally have these kinds of problems. But with me just siging up its just frusterating. I hope you all can get things stablized soon.
    Thanks again for reacting so quickly.
  • Options
    onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    Andy wrote:
    Our engineers are aware, and working on it. Will post updates here. Sorry for the hassle.

    Eureka!

    I know this is difficult for everyone to understand, but we actually *do* need to keep letting the site crash like this.

    Why? Because we can't fix the problem without knowing what causes it. So every time it goes down, we change one thing we think might help, and wait to see if it made a difference.

    This time, finally, we hit the jackpot. We know definitively what piece of hardware is failing, and we have spares standing by. It is a core piece of hardware, so fixing it permanently will result in the site being down for awhile.

    We're getting the site back up right now, and hope that it will limp along until maintenance tonight, but we're starting to prepare just in case it can't.

    More as I get it.

    Don
  • Options
    AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    nedens wrote:
    Don't mean to sound ungratefull as I do appreciate you all catching it very quickly and your punctuality in notify people. It really sounds like you all don't normally have these kinds of problems. But with me just siging up its just frusterating. I hope you all can get things stablized soon.
    Thanks again for reacting so quickly.
    We don't think you are ungrateful at all :D You should be a bit concerned, I'd be too. It's natural. Here's a recent posting by our CEO and Chief Geek:
    http://www.dgrin.com/showthread.php?p=462786#post462786
  • Options
    johngjohng Registered Users Posts: 1,658 Major grins
    edited January 18, 2007
    onethumb wrote:
    Eureka!

    It is a core piece of hardware, so fixing it permanently will result in the site being down for awhile.


    Don

    Can you please expand upon this statement? What is the estimated window you will need for repair? As an IT professional I understand things happen. And I sympathize with it taking time to identify the problem. But repair work should have an estimated duration. What is that duration. I'm in the middle of posting results from a shoot this past weekend and I'm getting complaints from customers that the site is not available. It's not those people that concern me though - it's the people from the event that DONT say anything and just don't come back.. So, If you could provide more information about the recovery window I can pass it along to my customers.

    Now that the problem is identified a generic "we're working on it and doing the best we can" isn't enough for business clients of mine. Just like the IT depertment in my company must provide our business clients with estimates I expect the same from Smugmug, my business partner. I realize you're doing the best you can to identify and fix things. But you also have to help us plan and manage our clients.

    Thanks,

    John
  • Options
    onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    johng wrote:
    Can you please expand upon this statement? What is the estimated window you will need for repair? As an IT professional I understand things happen. And I sympathize with it taking time to identify the problem. But repair work should have an estimated duration. What is that duration. I'm in the middle of posting results from a shoot this past weekend and I'm getting complaints from customers that the site is not available. It's not those people that concern me though - it's the people from the event that DONT say anything and just don't come back.. So, If you could provide more information about the recovery window I can pass it along to my customers.

    Now that the problem is identified a generic "we're working on it and doing the best we can" isn't enough for business clients of mine. Just like the IT depertment in my company must provide our business clients with estimates I expect the same from Smugmug, my business partner. I realize you're doing the best you can to identify and fix things. But you also have to help us plan and manage our clients.

    Thanks,

    John

    As an IT professional, you probably realize that any estimate I give you is simply a best guess, right? :)

    Because I've worked with some of the largest IT organizations on the planet, and have friends working at them to this day, and they rarely meet their estimates. It's sorta like software development that way - Murphy's law strikes almost every time.

    In all honesty, I think and hope the problem can be fixed in less than 30 minutes if everything went perfectly, but let's double that to an hour just to be on the safe side. That way, when it takes two hours, I'll only be off by 50%. :D

    FYI, we've narrowed it down to one of three things at this point: either an optical fibre channel cable (unlikely, since the error rate is so low), a bad disk inside of a RAID array (unlikely, since we're not getting any errors from the controller), or a bad RAID controller. We think it's the latter, and that piece of hardware isn't hot-swappable and doesn't have a hot-standby. It is a tool-less swap, though, so theoretically it should be very fast - but we'll just have to see.

    After it's been replaced, we then need to bring the data back online and do an integrity check. This will theoretically take the bulk of the time (15-30 minutes) since swapping the card is so easy and we have a relatively large amount of recently touched data to verify. So much of the repair downtime will just be waiting for data to spool off disk.

    The site is back up, and the error rate is relatively low for something that we're pushing many GBs through, so I'm hoping we can last until 10pm Pacific tonight without another crash, but should we crash again, we'll start implementing this repair process immediately.

    More as I get it.

    Don
  • Options
    johngjohng Registered Users Posts: 1,658 Major grins
    edited January 18, 2007
    thanks for sharing. Yep, I understand about estimates. But, to summarize your reply it sounds like you are implementing the fix now and estimated time to recovery is no more than 2 hours. So if I tell my clients the site will be available and working by 5 pm that should work.

    Thanks for the quick reply - it will really help me out.

    EDIT - looks like I read to fast and you're not planning on fixing rigth away. When do you plan on taking the system down for the planned fix if it doesn't crash again?
  • Options
    onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    johng wrote:
    thanks for sharing. Yep, I understand about estimates. But, to summarize your reply it sounds like you are implementing the fix now and estimated time to recovery is no more than 2 hours. So if I tell my clients the site will be available and working by 5 pm that should work.

    Thanks for the quick reply - it will really help me out.

    EDIT - looks like I read to fast and you're not planning on fixing rigth away. When do you plan on taking the system down for the planned fix if it doesn't crash again?

    Our weekly scheduled maintenance window begins at 10pm Pacific tonight. so hopefully we'll last until then, but if we don't, we'll do it immediately upon failure.

    Don
  • Options
    AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    And at 725pm we are down again, Onethumb is on the case... stay tuned.
  • Options
    FAU4UFAU4U Registered Users Posts: 29 Big grins
    edited January 18, 2007
    Site is up again 9pm.
    clap.gifSite is up again at 9pm. Thanks
    Andy wrote:
    And at 725pm we are down again, Onethumb is on the case... stay tuned.
Sign In or Register to comment.