Skip to main content

A tale of migration and miscommunication

Before this last week, I considered myself reasonably competent and able when it came to handling myself in a professional manner. As the migration I've been working on wrapped up, I was sharply reminded that I have much to learn.

Planning and research can only take you so far - communication and experience are strong factors in successfully pulling off important (any) projects.

First of all, this is not a doom and gloom scenario, a lot went right. We're just going to focus on what went wrong as that's where the good lessons are.
  1. Did not thoroughly check documentation of LOB software to ensure all the pieces were there to migrate it successfully.
  2. Did not communicate with LOB developer/support (same person in this case) to ensure they would be available for emergency support during the migration.
  3. When an issue did arise, used email as first line of communication.

So, let's deal with #1. The documentation was asked for when the software was written a year or so ago. When I initially went through it, I only noticed the client install, but figured the server part was at the bottom - it was just a database hosted on a SQL express instance running on the server, so wasn't much to it. Had I actually been looking for specific server-side installation instructions, I would have realized there was no guidance at all for that. That may have prompted me to try and set up a test instance in my lab - where I would have realized the problem.

Next up, #2. Assuming I'd done #1 properly, I would have contacted the developer right away, and not dropped a bombshell in his lap. To be fair, it was a pretty simple problem with an easy resolution, but an emergency none-the-less. By not contacting him prior to the migration I opened myself up for a potential show-stopper - what if he had been on vacation for three weeks? (I would have finagled the old server's SQL instance into life, but that's not the point)

Finally, I dropped the ball by again assuming he checked his email regularly (with an unhealthy dose of avoiding speaking to people). According to office staff, he did, so I went there first. My general instinct is to email first, as I've tended to get better response from technical people via email rather than phone. However, you can always get much more prompt attention when you're on the phone being a hassle. He responded right away on the following Tuesday morning when one of the office folks called him. We're still not sure why he did not respond to emails, but either way - email should never be the first line of offence when trying to resolve emergency situations.

Lessons learned:
  1. Take time to run through each step of the migration. Do not assume documentation will give you all the answers. For mission-critical applications, KNOW that they will work, don't guess so.
  2. Notify any parties with a stake in the operation about the event in question. Not via email, but by calling them so they KNOW something is happening.
  3. In an emergency situation (should one arise) - USE THE PHONE. Email should only be used in emergency situations to send helpful information, not convey the initial problem.

There is a final lesson in all this. When estimating your time for anything with 'migration' in the title, double your estimate. If it involves tools you've never used before, quadruple it to cover learning time. Even if you end up far below your estimate, your client will appreciate it. Further, do not attempt to rush a migration - you will end up sorry you tried to beat the clock for convenience sake. Always wiser to just back away until you have enough time.

Hope this helps someone avoid my mistakes.



If you're curious how the rest went, I'll be doing another post on the merits of migration helper kits.

Comments

Popular posts from this blog

DFSR - eventid 4312 - replication just won't work

This warning isn't documented that well on the googles, so here's some google fodder:


You are trying to set up replication for a DFS folder (no existing replication)Source server is 2008R2, 'branch office' server is 2012R2 (I'm moving all our infra to 2012R2)You have no issues getting replication configuredYou see the DFSR folders get created on the other end, but nothing stagesFinally you get EventID 4312:
The DFS Replication service failed to get folder information when walking the file system on a journal wrap or loss recovery due to repeated sharing violations encountered on a folder. The service cannot replicate the folder and files in that folder until the sharing violation is resolved.  Additional Information:  Folder: F:\Users$\user.name\Desktop\Random Folder Name\  Replicated Folder Root: F:\Users$  File ID: {00000000-0000-0000-0000-000000000000}-v0  Replicated Folder Name: Users  Replicated Folder ID: 33F0449D-5E67-4DA1-99AC-681B5BACC7E5  Replication Group…

Fixing duplicate SPNs (service principal name)

This is a pretty handy thing to know:

SPNs are used when a specific service/daemon uses Kerberos to authenticate against AD. They map a specific service, port, and object together with this convention: class/host:port/name

If you use a computer object to auth (such as local service):
MSSQLSVC/tor-sql-01.domain.local:1433

If you use a user object to auth (such as a service account, or admin account):
MSSQLSVC/username:1433

Why do we care about duplicate SPNs? If you have two entries trying to auth using the same Kerberos ticket (I think that's right...), they will conflict, and cause errors and service failures.

To check for duplicate SPNs:
The command "setspn.exe -X

C:\Windows\system32>setspn -X
Processing entry 7
MSSQLSvc/server1.company.local:1433 is registered on these accounts:
CN=SERVER1,OU=servers,OU=resources,DC=company,DC=local
CN=SQL Admin,OU=service accounts,OU=resources,DC=company,DC=local

found 1 groups of duplicate SPNs. (truncated/sanitized)

Note that y…

Logstash to Nagios - alerting based on Windows Event ID

This took way longer than it should have to get going...so here's a config and brain dump...

Why?
You want to have a central place to analyze Windows Event/IIS/local application logs, alert off specific events, alert off specific situations.  You don't have the budget for a boxed solution.  You want pretty graphs.  You don't particularly care about individual server states.  (see rationale below - although you certainly have all the tools here to care, I haven't provided that configuration)

How?
ELK stack, OMD, NXlog agent, and Rsyslog.  The premise here is as follows:

Event generated on server into EventLogNXlog ships to Logstash inputLogstash filter adds fields and tags to specified eventsLogstash output sends to a passive Nagios service via the Nagios NSCA outputThe passive service on Nagios (Check_MK c/o OMD) does its thing w. alerting
OMD
Open Monitoring Distribution, but the real point here is Check_MK (IIRC Icinga uses this...).  It makes Nagios easy to use and main…