I guess everything nowadays starts with “hello world” so here goes ours. We’ve set up to write a blog to share the knowledge and experience we’ve gathered over the years and, although it’s a bit of a cliche phrase, give back to the IT community, as it’s through all the blogs and forums out there that we got to where we are today.
That being said this is the first in a series of IT related articles which we’ll strive to publish/update periodically so here goes nothing. Stay tuned!
The One with the BSOD
You probably remember the 90’s sitcom Friends; what you might not recall is that all the episode names started with “The one with…”. Well, to make it a bit easier to digest, and because I kind of recall all this stuff as episodes from an IT engineer’s day to day work all the articles I’ll be publishing will contain this phrase.
The one with the BSOD…This was a bit of an interesting one;
Some few months ago I looked into a problem with a virtual machine not booting up. Well, it was booting up fine but right before the windows welcome screen it was BSODing with this screen.
The VM was a Windows 2008 R2 server running on Vmware sitting on shared storage(not gonna go into all the details here). Right before it started to misbehave it was powered down for a RAM increase. Once this was done and the machine powered back on it started to BSOD. I’m usually happy when a computer BSODs rather than simply restarting because there’s a dump file you can debug.
Sudden restarts usually suggest problems at lower level like hardware. However, as always, there’s a catch. If your system BSODs when running then that’s cool. Take the dump, analyze it and take actions. You have access to the system…you can deploy a fix easily. This one was crashing before I could enter any type of interface. Normal, safe mode, command line, last known good configuration…all crashing.
Where to start?
As expected, there was a dump file on our server (booted with a live cd). No luck…because it was a kernel dump it was not providing enough data to work on and figure out where the issue was. Sometimes you need a full memory dump to catch everything (user data included). You most likely know how to set a windows os to generate a full memory dump via the gui itself…but how do you do it without gui or registry access?
I did it by booting the vm with a windows live cd and mounting the system hive from c:\windows\system32\config\. I then edited the CrashDumpEnabled flag as per http://support.microsoft.com/kb/969028 / http://support.microsoft.com/kb/254649 and powered the VM up.
But…needless to say…even though the vm continued to BSOD and the screen was saying writing data to disk…no dump file was being created. Sad face, dead end.
So where’s the issue?
As I was implementing this reg change I could not ignore the fact that currentcontrolset was missing and all I had was controlset001 and 3. So like every IT guy out there that doesn’t know something but wants to know about it I googled it; and I found this – http://support.microsoft.com/kb/100010.
Ok, now everything clicks. (for those that don’t want to read the MS KB – CurrentControlSet exists only when windows is running. It is nothing more than the ControlSet001 key mapped under it. ControlSet00x is your last known good configuration).
I went back to the registry and started to look around. I started wondering how you’d get Windows to use a certain controlset which is how I ended up looking at the Select key.
I checked the Select key(HKEY_LOCAL_MACHINE\SYSTEM\Select) – which controls what controlset the system should use for normal booting and last known good configuration booting.
The default flag has the same value as the lastknowngood flag, or vice versa. This means that each time I wanted to boot into last known boot configuration I was actually booting into the default one, current one. Made the changes and pointed lastknowngood flag to the backup controlset. Restarted the vm and…..it booted just fine.
Argh…there’s something in the registry that is causing this…but what? Went back to registry and devised a simple trial and error plan. Exported each subkey from the working controlset and had that imported into the non working one. Powered the vm. So on and so forth until I found the subkey that was making the vm stable. In this case the subkey Control. Then I went one step further and exported every subkey from this key and repeated the test. Half a day later I ended up with the faulty key: “hklm\controlset01\control\session manager\environment” Right …which flag is it then? Ran the same tests as above, excluding one by one and…..PATH was the one.
The one causing the problem was having some extra entries at the beginning. After removing that entry (in red below) and leaving the default (in regular black) the server booted just fine.
I don’t know if the length of the path was causing this or something else, but it sure was an interesting catch.