Information Technology: 2010

Thursday, August 19, 2010

Friday, May 28, 2010

MSCS Basics

About
Sources
Clustering Basics
MSCS Basics
MS Clustering Links
Miscellaneous
Home

Microsoft Cluster Server
Microsoft Cluster Server ("MSCS") is clustering software that first shipped with Microsoft Windows NT Server - Enterprise Edition. MCSC 1.0 (codenamed "Wolfpack") was released in 1997. Since then, MSCS has been upgraded to version 1.1 in Windows 2000 Advanced Server and Datacenter Server and to version 1.2 in Windows Server 2003 Enterprise Edition and Datacenter Edition.

MSCS supports clusters nodes which are specially linked servers running the cluster service. The primary function of MSCS occurs when one server in a cluster fails or is taken offline. With MSCS, the other server in the cluster takes over the failed server’s operations. Clients using server resources experience little or no interruption of their work because the resource functions move from one server to the other. The primary purpose of clustering is to provide failover and reinstantiation of services and resources, thereby providing increased availability for the services (e.g., messaging, database, file and print, etc.).

MSCS is comprised of two main components: clustering software and the Cluster Administrator (cluadmin.exe, a GUI and cluster.exe, a command-line management tool). The clustering software enables the two servers of a cluster to exchange specific types of messages that trigger the transfer of resources at the appropriate times. The clustering software has two primary components: the Cluster Service and the Resource Monitor. The Cluster Service runs on each cluster server. It controls cluster activity, communication between cluster servers, and failure operations. The Resource Monitor handles communication between the Cluster Service and the application resources. The Cluster Administrator is a graphical application that is used to manage a cluster. It runs on any version of NT (server, workstation) that has Service Pack 3 or later installed, Windows 2000, Windows XP and Windows 2003.

In MSCS, a cluster is a configuration of two nodes, each of which is an independent computer system. Together, these independent servers create a "server cluster." The cluster appears to users as a single server. For MSCS, both nodes must be running NT Server - Enterprise Edition, Windows 2000 Advanced/Datacenter Server or Windows Server 2003 Enterprise/Datacenter Server. The network applications, data files, and other tools you install on the nodes are the cluster resources, which provide services to network clients. A resource is hosted on only one node at any time. The figure below shows the relationship between nodes, groups, and resources.

Picture Source: Microsoft Cluster Server Administrator's Guide

Windows Cluster Terminology

Clustering introduces several new terms which should be thoroughly understood before clusters of any kind are implemented.

*

Node. The term used to refer to a server that is a member of a cluster.
*

Resource. A hardware or software component that exists in a cluster, such as a disk, an IP address, a network name, or an instance of an Exchange 2000 component.
*

Group. A combination of resources that are managed as a unit of failover. Groups are also known as resource groups or failover groups.
*

Dependency. An alliance between two or more resources in the cluster architecture. You’ll need to understand cluster resource dependencies when installing a cluster.
*

Failover/failback. The process of moving resources from one server to another. Failover can occur when one server experiences a failure of some sort or when you, the administrator, initiate a proactive failover.
*

Quorum resource. This is a special type of cluster resource that provides persistent arbitration mechanisms by allowing one node to gain control of it and then defending that node’s control. In addition, it provides physical storage that can be accessed by any node in the cluster (only one node can access the quorum at any given time). The quorum also maintains access to the most current version of the cluster database, and if a failure occurs, the quorum writes the changes to the cluster database.
*

Heartbeat. The network and Remote Procedure Call (RPC) traffic that flows between servers in a cluster. Windows 2000 and Windows 2003 clusters communicate by using RPC calls on IP sockets with User Datagram Protocol (UDP) packets. Heartbeats are single UDP packets sent between each node’s every 1.2 seconds. These packets are used to confirm that the node’s network interface is still active.
*

Membership. This term is used to describe the orderly addition and removal of active nodes to and from the cluster.
*

Global update. This term refers to the propagation of cluster configuration changes to all members. The cluster registry is maintained through this mechanism.
*

Cluster registry. Inside the Windows 2000 registry is the cluster registry—also known as the cluster database. This maintains configuration information on each member in the cluster, as well as on resources and parameters. This information is stored on the quorum resource.
*

Virtual server. A virtual server is a combination of configuration information and cluster resources, such as an IP address, network name and application resource.
*

Active/Active. From a software perspective, this describes applications (or resources) that can existing as multiple instances in a cluster. This means that both nodes can be active servicing clients.
*

Active/Passive. This terms describes applications that run as a single instance in a cluster. This generally also means that one node typically sits idle until a failover occurs. However, you can have an Active/Passive implementation of an application in an Active/Active cluster. An example of this would be a cluster that contained clustered file and print sharing resources and a single Exchange or SQL virtual server.
*

Shared storage. This refers to the external SCSI or fibre channel storage enclosure and the disks contained therein. Shared storage is a requirement for multi-node clusters. Although this storage is shared, only one node can access an external storage resource at any given time.

Windows 2000 Cluster Services
Windows NT 4.0 Enterprise Edition included Microsoft Cluster Server 1.0. Windows 2000 Advanced Server and Windows 2000 Datacenter Server include the Windows Cluster Service, or Microsoft Cluster Server 1.1. Aside from the 4-node capabilities of Windows 2000 Datacenter Server, there aren't a whole lot of differences between the two versions. Windows 2000 clusters do add the following enhancements, though:

*

Event Log Replication
*

More services are supported (e.g., DFS, IIS, etc.)
*

Improved client network recovery
*

Support for rolling upgrades

Clusters created using Microsoft Cluster Server or Windows 2000 Cluster Services are known as Server Clusters.

Windows Load Balancing Service ("WLBS")
WLBS is a load balancing feature for Windows NT TCP/IP applications that supports load balancing and clustering for web-based services such as Internet Information Server (web, FTP, etc.), streaming media, virtual private networking (VPN), and Microsoft Proxy Server. WLBS was formerly a product called Convoy Cluster Software by Beaverton, Oregon company Valence Research, Inc. Microsoft acquired Valence Research on August 25, 1998. Both microsoft.com and msn.com used WLBS (now NLB, see below) to manage the high volume of traffic that these sites get. WLBS is a free download for Windows NT 4.0 Enterprise Edition users.

Network Load Balancing ("NLB")
Formerly known as Windows Load Balancing, Network Load Balancing is the name of the TCP/IP application load balancing software in Windows 2000 Advanced Server and Datacenter Server, and in all editions of Windows 2003. NLB clusters distribute client connections over multiple servers, providing scalability and high availability for TCP/IP-based services and applications.

Component Load Balancing ("CLB")
CLB is also known as Application Load Balancing, and sometimes as COM+ Load Balancing. It is a way for developers to code COM+ components for use by multiple application servers. CLB provides scalable, reliable and load-balanced activation of COM+ objects across application-cluster members in a manner transparent to clients. This enables virtualization of an application much in the same what that NLB and Server Clusters virtualize servers. A primary CLB servers watches for application server failures and automatically moves the affected objects to another cluster member.

How Many Servers can I cluster?
MSCS and Windows 2000 Advanced Server support a maximum of two servers (nodes) per server cluster. Both servers must be running either the Enterprise Edition of NT Server 4.0 or Windows 2000 Advanced Server. Windows 2000 Datacenter Server supports a maximum of four nodes per cluster.

How many servers can I load balance?
WLBS and NLB support a maximum of 32 nodes.

Windows 2003 Cluster Services
There are a substantial number of improvements in Windows 2003 Server Clusters and NLB:

*

Windows Server 2003 Enterprise Edition, the upgrade to Windows 2000 Advanced Server, will support up to eight nodes per cluster.
*

Windows Server 2003 Datacenter Server will also support up to eight nodes per cluster.
*

All versions of Windows Server 2003 will include Network Load Balancing.
*

A number of Server Cluster enhancements, such as 64-bit support, greatly improved setup, and Active Directory integration.
*

A number of Network Load Balancing enhancements, such as better manageability, Multi-NIC support, and bi-directional affinity support for clustered ISA Servers.
*

Windows Server 2003 introduces the concept of a Majority Node Set. This allows server clusters to be built without using the shared disk for the quorum. This enables you to build and configure geographically dispersed clusters.
*

Active Directory integration
*

Support for Dynamic Disks and Encrypting File System
*

Enhanced setup, resource configuration and management

Windows 2003 NLB Services
There are also a number of NLB enhancements in Windows 2003, including:

*

IGMP support (to limit switch flooding)
*

Bi-direction affinity, which enables load balancing of ISA Server
*

Support for multiple network interface cards
*

Virtual clusters for traffic filtering, host preference and separate configurations
*

New NLB Manager utility

What Microsoft BackOffice applications can I cluster?
You can cluster several BackOffice applications: Exchange Server 5.5 Enterprise Edition, Exchange 2000 Enterprise Server, SQL Server 6.5/7.0/2000 - Enterprise Edition, Internet Security and Acceleration Server 2000, and Internet Information Server 4.0 and 5.0. You can also cluster file and print services, as well. You cannot cluster Microsoft SNA Server (or Host Integration Server), Microsoft SMS, Microsoft Proxy Server, or RAS. Using NLB, you can load-balance Microsoft Proxy Server and ISA Server, Outlook Web Access, intranets, and other IP-based applications. Using Windows 2000 Advanced Server, Windows Server 2003 Enterprise Edition, or Windows Server 2003 Datacenter Edition, you can also cluster WINS, DHCP, and DFS. Look here and here for more details.

Requirements of Server Clusters (MSCS and Cluster Service)
There are certain things you need to have before you can cluster servers using MSCS or Windows 2000 Cluster Services. For example, you need:

*

Two Servers (recommend hardware from cluster category of Microsoft's HCL with identical configurations, e.g., RAM, CPUs, etc.)
* PCI Cluster Interconnect (two are recommended for redundancy)
* Two NICS per node (one for private network; one for public)
* TCP/IP
* External storage cabinet
* SCSI or Fibre Channel for shared disks

Best Practices
There are some general things you can do to ensure your cluster configurations are as robust, reliable, available, and scalable as possible.

1. Perform an risk audit. This involves analyzing the various components of your cluster to determine what remaining single points of failure exist. For instance, do you need additional protection for network connectivity (e.g., redundant path) or power-loss protection (e.g., UPS), and the like. You can find a sample risk audit table here.

2. Make sure your application can be clustered. There are two main application types within the context of clustering:
* Cluster-aware applications: these applications can use cluster services via its API. These applications provide a DLL file that uses the LooksAlive and IsAlive APIs to manage resources within a cluster. These applications are designed to be clustered.
* Cluster-unaware applications: these applications don't use any APIs, don't have special DLL files and are not managed as cluster resources. However, that doesn't mean that these applications cannot be installed on a cluster, or that they cannot failover properly. There are many applications that are not cluster-aware which can still function normally in a clustered environment. Be sure to test your application in a test cluster before going into production.

3. Make sure your hardware is on Microsoft's Hardware Compatibility List for Clustering. If your hardware is not on the Cluster HCL, don't use it. It may work, but it is not supported by Microsoft and your results will be unpredictable. Familiarize yourself with Microsoft's Support Policy for Server Clusters and the HCL.

4. Don't skimp on other areas of redundancy. Where possible take advantage of other technologies with redundancy. Use RAID arrays on internal drives and the external shared disk. Use redundant network adapters and networking equipment, power supplies, fans and CPUs.

5. Test your cluster in a lab environment. Test EVERYTHING! This includes, UPS software/hardware, backup software, simulated failures, manual failovers, and the effect of failures that clustering doesn't protect you from (e.g., router failures, etc.). Even go so far as to yank network cables, drives that are part of RAID arrays and powering off one of the nodes. The more testing you do, the better prepared you'll be to manage and support your cluster(s).

6. Research, Research, Research. Read the white papers and other documents published by Microsoft. Check out the Microsoft public newsgroups. Talk with other cluster operators, application vendors and hardware manufacturers. You simply cannot digest enough information on clustering. After all, you want your cluster to be as reliable and available as possible. You can find some useful links to clustering information here.

Clustering Basics

Clustering Basics

About
Sources
Clustering Basics
MSCS Basics
MS Clustering Links
Miscellaneous
Home

Clusters Defined
A cluster is a group of independent computers working together as a single system to ensure that mission-critical applications and resources are as highly-available as possible. The group is managed as a single system, shares a common namespace, and is specifically designed to tolerate component failures, and to support the addition or removal of components in a way that's transparent to users. Clustered systems have several advantages: fault-tolerance, high-availability, scalability, simplified management and support for rolling upgrades, to name a few.

There are two different types of cluster models in the industry: the shared device model and the shared nothing model.

In the shared device model, applications running within a cluster can access any hardware resource connected to any node in the cluster. As a result, access to the data must be synchronized. In many such implementations, a special component called a Distributed Lock Manager (DLM) is used for this purpose. A DLM is a service that manages access to cluster hardware resources. When multiple applications access the same resource, the DLM resolves any conflicts that might arise. Along with this sophistication and complexity, a DLM adds significant overhead to the cluster. Most of this is additional traffic between nodes; however, a performance hit is also realized due to the loss of serialized access to hardware resources.

By default, Microsoft Cluster Server and the Windows Cluster Service use the shared nothing model. Because this model does not use a DLM, it does not have the overhead incurred by using such a service. In the shared nothing model, only one node can own and access a single hardware resource at any given time. When failure occurs, a surviving node can take ownership of the failed node's resources and make them available to users.

While both Microsoft Cluster Server and the Windows Cluster Service support the shared nothing model, they can use the shared device model, but only if the clustered application supplies its own DLM.

Why Cluster?
Generally speaking, hardware failure is not the predominant cause of downtime. The leading causes of downtime are typically related to events that are external to the system, such as misconfiguration, power outages, security breaches, and so forth. Clustering cannot help you solve those types of problems. In addition, a cluster cannot protect you from software incompatibilities, corrupt databases, viruses, catastrophes or mistakes. Clustering is best implemented when a substantial proportion of your server downtime is caused by hardware failure. If your organization’s leading cause of downtime is the result of failures in administration, software, or infrastructure, an investment in clustering technology may not reduce your downtime.

You first need to assess the reasons for server downtime in your organization, look at the problems that clustering solves, and then make a business decision as to whether clustering is an appropriate solution. The primary focus of clustering is solving problems that arise from hardware failure, such as a blown CPU, bad memory, or the loss of an entire server. In addition, clustering allows you to continue providing resources during planned outages that may cause downtime for users. A cluster system can allow resources to be manually moved—or failed over—to one server while the other is brought down to perform a rolling upgrade, a configuration change, or other maintenance.

A rolling upgrade is the process of applying a service pack or other hardware or software update to each node in the cluster while the other node continues providing service. Rolling upgrades are typically a series of stages:

* Groups are moved from the node to be upgraded to another node.
* The node to be upgraded is taken offline.
* Perform the installation on or upgrade to the offline node.
* Bring the upgraded node online.
* Move the groups back to the upgraded node.

Then, repeat this process on each node in the cluster until the entire cluster is upgraded. Rolling upgrades are very attractive from a server management standpoint because services are only unavailable during the time it takes to move resources from one node to the other. By design, clusters help increase uptime. Increased uptime really means reduced downtime. Clustering can help reduce both planned and unplanned downtime. When any mission critical system fails, the consequences can include lost revenue, interruption of services to customers, and knowledge workers unproductively sitting idle. In organizations of all sizes, failures incur costs in many areas. Hidden costs often include damage to your reputation among customers, suppliers, and end-users; and the perception that your organization isn’t able to satisfy customer needs. Understanding the limitations of clustering is just as important as understanding the benefits. While clustering protects against the failure of a node in the cluster, it does not provide any protection against other problems, such as network failures, database corruption, loss of shared storage, or disasters.

Before implementing a cluster in your environment, you should evaluate whether this solution really solves enough of your problems to justify its cost. Clustering adds complexity to your environment and administration. Therefore, it is important that you understand and evaluate this technology in relation to your overall goals and the needs of your network.

Fault Tolerance Defined
Fault tolerance is the ability of a system to continue functioning when part of it fails (e.g., experiences a fault). This term is used to describe disk subsystems (e.g., RAID), symmetric multiple processors (SMP), redundant power supplies (with separate power sources), uninterruptible power supplies, redundant network adapters, etc. Fault tolerance is designed to alleviate the problems caused by component failures, power outages, or other like occurrences.

Disk subsystems that use RAID, which stands for Redundant Array of Inexpensive Disks (or Redundant Array of Independent Disks, or Redundant Array of Inexpensive Devices, depending on who you ask) are considered fault tolerant. RAID refers to the grouping of individual hard disks in a way that provides continued operation in the event of a disk failure. There is both hardware RAID (e.g., a RAID controller is used) and software RAID (e.g., the functionality is provided by an operating system or application). There are many forms (levels) of RAID:

* RAID-0: Stripe set without parity. Stripe sets work well with databases due to the usually random I/O nature of database transactions. In RAID-0, data is divided into blocks and spread (in a fixed order) across all of the disks in an array. RAID-0 improves read/write performance by spreading operations across multiple disks, so that operations can be performed independently and simultaneously. While RAID-0 provides the highest performance, it does not provide any fault tolerance. If a drive in a RAID-0 array fails, all of the data within the stripe set becomes inaccessible.
* RAID-1: Mirroring. Disk mirroring provides a redundant, identical copy of a disk. Data written to the primary disk is also written to a mirror disk. RAID-1 provides fault tolerance and generally improves read performance, but it may also degrade write performance. Because dual-write operations can degrade system performance, many mirror set implementations use duplexing, where each mirror drive has its own disk controller. While the mirror approach provides good fault tolerance, it is relatively expensive to implement. In addition, only half of the available disk space can be used for storage. The other half is needed for mirroring.
* RAID-5: Stripe set with parity. RAID-5 provides redundancy of all data on the array, allowing a single disk to fail and be replaced, in most cases, without system downtime. RAID-5 offers lower performance than RAID-0 or RAID-1 but higher reliability and faster recovery. RAID-5 uses the equivalent of one disk for storing the parity strips, but distributes the parity strips across all the drives in the array. The data and parity information are arranged on the disk array so that they are always on different disks.

There are other implementations of RAID, such as RAID-0+1 (aka RAID-10), RAID-2, RAID-3, etc., but these are typically proprietary implementations unique to the hardware manufacturer that support them.

High-Availability Defined
By definition, the goal of a highly available system is to provide continuous use of critical data and applications that keep businesses up and running, regardless of planned or unplanned interruption. High-Availability refers to a system uptime that approaches 100%. For example, an availability level of 99.999%, calculated on a round-the-clock basis, would mean that an organization would experience at least five minutes of unscheduled downtime per year. A level of 99.99% translates to 52 minutes of downtime. A level of 99.9% translates to 8.7 hours, and a level of 99% equals about 3.7 days of downtime per year.

The need for high-availability is not limited to 365x24x7 environments. Many applications must be available during normal business hours or for a critical time periods throughout the day. A system failure during these critical periods is unacceptable for many organizations.

Alternatives to Microsoft Cluster Server and Windows 2000 Cluster Services
Alternatives include Vinca's Co-StandbyServer, Vinca's Octopus, and Network Specialists' Double-Take. In addition, there are the shared storage clustering solutions provided by Digital's Clusters for Windows NT, NCR's LifeKeeper or Veritas FirstWatch. Finally, there is the fault tolerance and site disaster tolerance of Marathon's Endurance 4000.

Information Technology

Thursday, August 19, 2010

Basic Linux Commands

Basic Linux Commands

Friday, May 28, 2010

MSCS Basics

Clustering Basics

you are visitor number

About Me