Thursday, November 12, 2009

Highlevel Availability -- continue

Taking precautions against unplanned downtime

Although one cannot prevent failures, one can take all reasonable precautions. The most obvious and most important precaution to take is to put in place a well-designed and implemented backup regime, taking into account the special requirements of the application. This is a subject beyond the scope of this article but, nonetheless, it is worth emphasizing its importance. Also…

Process failure precautions

• Control access to the server and server room
• Ensure that all of your team clearly understand their roles and responsibilities.
• Implement change controls to ensure that all software and hardware changes to a production server are documented. Because Change control systems require the signoff by several team specialists it allows them to check for potential problem .
• Document and map all of your SQL Server instances, being particularly careful to record application relationships such as replication or log-shipping, data-feeds, message routes, links, remoting, and file transfer routes.
• Make sure your Test servers are identical in configuration to your Production servers.
• Before applying patches, hot fixes, and service packs, test them first on a Test Server.

Change failure precautions

• Document all proposed changes
• List the expected impact on the production system
• Gain consensus and sign off for the changes, as appropriate.
• Test the effect of the changes in terms of functionality and Stress/Load.
• Document the roll back/reversion plan and test it out on those who are likely to be 'early responders' to a system failure.
Natural, and man-made, disaster precautions
• Arrange for the service to be mirrored, or held at 'Warm stand-by', a long way away. Test out the ability to switch the service remotely. SAN replication is a popular solution, but mirroring is very effective

Hardware failure precautions

• Simulate failure in all likely places to check that secondary hardware 'kicks-in' as expected.
• Make sure there is an architecture diagram, and clear instructions for all hardware recovery routines, which are easily understandable to the 'first responder'.
• Provide generous battery-backup.
• Use redundant power supplies
• Use hardware and software monitoring tools: hardware often gives out warning signs before 'letting go'.
• Use a RAIDed array or SAN for storing your data, with hot-swappable drives with available spares. A 'Stripe of Mirrors' (Raid 10) is probably best practice.
• Install redundancy in storage controllers.
• Place the databases of your server on a different raid array to the transaction log. Locate TempDB on a high performance RAID array. SQL Server cannot function without it.
• Provide both Network card and router redundancy
• Ensure at least 'Warm Standby' fall back servers by using clustering, database mirroring, synchronization or log shipping.

Software Failure precautions

• Software failure can happen due to software changes, but also when data changes. Even date changes can cause failure. 'Code Rot' is the common term for software system failure when no recent software changes have been made.
• Use Change and source control (see change failure above)
• Before rolling out a production release, do strict 'limit' testing (testing under the extremes of data or throughput, and with hardware components randomly unplugged to assess whether software degradation is 'graceful' or not)
• Perform Regular regression testing on the test server with different simulated loads
• Avoid overlapping jobs in the SQL Server Agent; do routine DBCC checks and re-indexes of tables at off-peak times.

Network Failure precautions

TCP/IP is designed fundamentally as a resilient system in the event of disaster, but this relies on the network infrastructure being able to route network packets via alternative pathways in the event of the failure of a pathway.

• Secondary DNS/WINS servers must be provided.
• The system must not be reliant on a single domain server or active directory.
• There should be Redundant routers/switches
• Redundant WAN/Internet connections are generally important.
• Ensure that there is no single point of failure in the network by regular 'limit-testing'

Security Failure precautions

• Ensure the physical security of each SQL Server.
• Create alerts and reports for any unusual patterns of user activity on the server, and investigate them (SQL Data Compare is very handy for this)
• Give users the fewest permissions they need to perform their job.
• Audit all login and logout events
• Use DDL triggers to log and notify all changes to the security configuration of the server.
• Adopt all current security best-practices when implementing the Server

Tuesday, November 3, 2009

Overview of Highlevel Availability

Why High Availability?

Mission critical computer systems need to be available 24 hours a day, 7 days a week, and 365 days a year.

When this system and the applications are not available, due to technical problems, we suffer loss of productivity at work and inconvenience in our personal lives. It is not difficult to calculate the real costs incurred when critical business systems are down.
More serious consequences occur when critical systems such as traffic control, medical life support, or Health services systems are not functioning.

When an application becomes unavailable, the work that it was doing simply stops.
At best, such an outage simply results in lost productivity - the application will be up and running some time later, and the work will be completed later. More serious consequences can occur through safety, legal actions, fines or simply negative publicity.
The impact of downtime will vary from business to business and within a business from application to application.

Availability, High Availability, and Fault Tolerance: What do these terms mean?

Availability is the percentage of time that a system operates during its intended duty cycle. For example, if a given system is expected to be functional for 8 hours per day, then availability is measured as a percentage of those eight hours. If a system is non-functional outside this period, it is not counted against the “Availability Metric.”

High Availability
attempts to specify an amount of time as a percentage of the intended duty cycle that a system must be functional. For example, if we specify availability metric as “Five Nines,” it is understood to mean that the system should be functional for 99.999% of the desired duty cycle.
Refer to the following table for examples of various levels of availability and associated allowable downtime per year/month/week assuming a 24 hour per day duty cycle.

Availability % Downtime / year Downtime / month(30 days) Downtime / week
99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

99.99% ("four nines") 52.6 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

There are two main hardware concerns with respect to maintaining a highly available database environment: server high availability and storage availability.

Fault tolerance (Data redundancy in DB)
differs from high availability by providing additional resources that allow an application to continue functioning after a component failure without interruption. Many of the high-availability solutions on the market today actually provide fault tolerance for a particular application component. Disk mirroring, where there are two disk drives with identical copies of the data, is an example of a fault-tolerant component. If one of the disk drives fail, there is another copy of the data that is instantly available so the application can continue execution.

High Availability (HA) solution

High Availability (HA) solution must address both unplanned and planned causes of downtime to achieve a truly fault tolerant and resilient IT infrastructure.

Causes of Planned down time

Repair and upgrades that have minimal impact on the business are considered maintenance. For many applications, availability during business hours is required, but some downtime during non-business hours is acceptable.

All systems will require maintenance at some point. If management does not plan for system maintenance, the system will pick the time and duration for an outage! It is up to the system designer to understand the business need and design the system to allow for planned downtime, therefore minimizing the risk of a system failure.

Causes of Unplanned down time

 * Hardware failure
   -->Server hardware
   --> Storage hardware
 * Human error
  -->System management toll
  -->Staff training
  -->Process oriented IT organization
 * Software corruption or bug
 * Viruses
 * Natural disaster