Home lab - arguments for a 'production' environment

As I've used my lab over the last year or so, I've come to a pretty strong conviction:  If you are simply running 'test' VMs in a throwaway environment you will lose out on some key "free" 'production' experience.  A lot of quotes, but let me try and explain.

Suppose we have a test lab with a few ESXi hosts, vCenter, and whatever flavour OS environment you like.  You have an IIS/Apache test site, vCenter, SQL/mysql cluster, AD domain, vcb, that sort of stuff.   All well and good.  One day something in the ESXi/vCenter infrastructure breaks and you get frustrated trying to solve the problem.  No big deal, wipe and replace, right?  Restore from backup with no thoughts.

Now, suppose we have the same test lab, but you are running your own website with SQL back-end, monitoring software for both your internal lab environment and a few of your clients' sites, your mail comes in to an Exchange box, your AD site is used by the home PCs for authentication, etc.  One day something in the infrastructure breaks.  Can't just wipe and replace!  Have to figure it out, and fast!

It's the latter scenario that will help hone your 'under pressure' troubleshooting skills without getting you fired (you were wise enough to keep your wife's things somewhere safe, right?), and change your outlook on troubleshooting in the workplace.  As others have noted, admins these days find it too easy to just say 'blow it away and start fresh'.  It's not easy to actually figure out what the root cause is, but sometimes for legal/financial reasons you must figure it out.  So, the faster you can do this, the better!  This can vary in different environments, but larger companies may be upset when you can't provide a solid reason why something broke.

All this to say, when you come home from work and one of your hosts is acting up, your cluster is broken, while it can be a stumbling block, you'll usually come out ahead with a few more tricks under your belt, and a little experience goes a long way.

P.S.  If you were wondering, my lab is currently undergoing issues, and I run things like the PTC wiki off it, so taking it down I consider 'downtime' to be avoided.


Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

Fixing duplicate SPNs (service principal name)

Logstash to Nagios - alerting based on Windows Event ID