Modelling a software problem

A solution to every software problem must have two characteristics, it must be correct and and it must be efficient. One of the most important steps of solving a problem is to model the problem. A good software model of the problem determines how the problem gets solved and how efficient the solution is. Right modelling requires a good knowledge of software techniques and tools.

An engineer with good knowledge of data structures and algorithm techniques should be able to the model a single problem in multiple different ways, be able to see the trade-offs in the different models and pick a model using which the problem can be solved. Once the model has been decided, then a variety of tools can be used to implement the solution.

Here is a example. Facebook lists a variety of problems in their puzzles page that is ordered by how hard it is to solve the puzzle. Let us look at one of the hardest puzzles,  "FaceBull". The problem is to find the cheapest way to produce all the chemical compounds needed for the new super-energy drink from any single source compound. How would you model this problem?

Let each compound to be a vertex of a graph. There will be a edge between any two compounds, when there is machine that converts one compound to another. The weight of this edge is the cost of acquiring the machine that does the conversion. Let us draw this graph with the example input given in the problem.

Now in this model, the problem becomes, starting from any vertex find a path with the minimum cost that visits every vertex of this graph atleast once. This is a variation of the classic traveling salesman problem(TSP). Traveling salesman problem is a NP-complete problem. No wonder Facebook marked this as a hard problem.

We have identified a well known and well studied problem. The brute force approach to solve the problem is in the order of O(n!), where n is the number of vertices. It will take years for such a program to complete for just 20 vertices. Although it is difficult to solve TSP and to find the optimal solution for all inputs, there are many heuristics and approximation algorithms that will give a solution that is close to the optimal solution.

Notice how we transformed a chemical industry application problem into a well studied computer science problem with the right modelling.

When Nick and Thorsten were visiting the bayarea for Open HA Cluster Summit, we talked about some of the difficult problems in Solaris Cluster and how we may model them. This is one of the exciting aspects of working at Sun and at Solaris Cluster. There are many difficult problems to solve and people are eager to solve it. I  love the opportunity that every difficult problem presents. Stay tuned, we are just getting started.

[ cross posted at  ]


Shared Nothing Storage in Open HA Cluster

Two years back I led and designed a project to make Solaris Cluster easy to use. The wizards that resulted from this effort is a key element of the Solaris Cluster user experience now. Many new projects want their features supported using the wizards today. The architecture of the project even made it to the IEEE Cluster conference.

This past year I led another important effort for Solaris Cluster. This feature, Shared Nothing Storage, that was released as part of Open HA Cluster 2009.06 removed a major hardware requirement for the cluster: the necessity to have a shared storage. This was achieved by configuring the iSCSI protocol stack present in COMSTAR in a particular fashion and layering a ZFS mirror on top. This feature allows a user to use any local disk present in the system as a storage for the service and to make that service highly available.

There is no need to turn disk fencing on for this configuration and therefore it also removes the need to have SCSI reservations. Here is a picture of the configuration, with detailed configuration instructions here.

The key challenge in providing this feature was to make the cluster device subsystem robust enough to handle devices that are attached via the network. The design details are present here.

This configuration becomes more interesting when I/O multipathing is configured, because it shows the flexibility and the power of the COMSTAR architecture. With COMSTAR, a single logical unit of storage can be accessed via multiple port providers, multiple iSCSI targets in this case. These multiple iSCSI targets can be used to create multiple paths to the same logical unit. This provides fast mirroring of data in the cluster configuration. If you want to understand the different configurations with multi-pathing, Aaron Dailey and Scott Tracy have a excellent white paper on using MPxIO on Solaris. Here is a picture of the cluster configuration with I/O multipathing.

 Try it out. Join the discussions at

Synchronization of common agent container security files

Solaris Cluster uses common agent container as part of its management infrastructure. The common agent container (CAC) uses public key mechanisms for encryption and authentication. Here is the complete guide that explains CAC in lot more detail.

In Solaris Cluster, the CAC keys must be the same on all the nodes of the cluster, so that the management infrastructure can communicate with all the cluster nodes. Cluster software ensures that these keys are same on all the cluster nodes. However there could be scenarios when these keys go out of sync. When that happens, you will start seeing errors like below,

             ERROR: Unable to connect to the common agent container on node
             pneta1. Ensure that the common agent container is running and you
             have the required authorizations to connect to the common agent
             container on this node.

    Press RETURN to continue


 Here are the steps to correct this situation.

1. Stop CAC on all the cluster nodes

   #/usr/sbin/cacaoadm stop 


2. Copy the CAC security files from one node of the cluster to all the other nodes of the cluster.

    On any one node do, 

   cd /etc/cacao/instances/default/

   tar cf /tmp/SECURITY.tar security

   then transfer the SECURITY.tar to all the nodes and do,

   cd /etc/cacao/instances/default/

   tar xf /tmp/SECURITY.tar

   You can now remove all the copies of SECURITY.tar


3. Restart the CAC on all the cluster nodes

    /usr/sbin/cacaoadm start

 This procedure is explained in detail here. Join our communities around CAC and Solaris Cluster for more.

Changing Sun Cluster Manager port, 6789

There have been requests from people who want to change the port through which Sun Cluster Manager(SCM) is accessed. SCM, like many other web applications from Sun, is accessed through the Sun Java Web Console. By default, Sun Java Web Console is accessed via a secure HTTP port 6789. In fact, the port numbers 6786 to 6789 are assigned for Sun Java Web Console and no other application should use these ports.

Here is a procedure, that I used recently, that changes these ports, if necessary. Maybe this will be useful for others as well.

1. Find out the version of the Sun Java Web Console that you currently have.

    /usr/sbin/smcwebserver -V

2. If the version is 3.0.2, then do the following.

   smcwebserver stop

   cd /var/webconsole/domains

   rm -rf console

   cd /etc/webconsole/console


   rm regcache/


       Replace values for console_httpsport and console_httpport
       // If on Solaris 10, clear the service:

      svcadm clear system/webconsole:console

   smcwebserver start

3.   If the version is greater than 3.0.2, then do the following.

   smcwebserver stop

   /usr/share/webconsole/bin/wcswap -t tomcat -s <nnnn> -p <nnnn>
       // If on Solaris 10, clear the service:

      svcadm clear system/webconsole:console

   smcwebserver start