Wednesday, April 11, 2012

Few essentials to focus

Dr.Neil Gunther brought out clearly in a series of articles in his blog on importance of Stretch Factor(R/S) along with OS-run queue& CPU Utilization.
Infact there was surprisingly very similar attempts/approaches (there is a book written from another gentleman Dr Leonid Grinshpan on queueing theory mentioned elsewhere caught my recent attention. All this is good and reinforces the concepts clearly in my mind.

But i do find Dr Neil Gunther's blog articles very clear to the point ( Being a mathematician and done some thesis work on Probability i am usually very sharp/quick to catch anything on queuing theory with math involved along with my technical background on Telecom and Database systems)

For a performance resultant following checklist of items may be useful:

Whenever your scalability/perf test workloads clearly stretch some of the resources in your setup viz CPU and/or disk storage so that the

Stretch factor (R/S ) goes way more than acceptable SLA values it is time to stop and think a little on following lines:

i) Just a very quick check of the Hardware/Software setup to spot low hanging fruits ( This need not be too invasive and some of the things like centralized storage( NAS/filer storage if you are using) may be beyond your reach along with your server internal bus bandwidth etc. All you can do is to document what you see briefly and move on. i.e Is it a single headed NAS with a NVRAM write cache ON?, how many LUNs you had used, How were the LUNs carved out of the filesystem at filer end etc.
If you dont have clues/answers do not worry and move on. You can always understand that storage with latest trends should behave like reading of the memory with the limitation of network topology (software/hardware adaptors/HBA, NIC and mode of transport/congestion etc) to the storage. Ofcourse db workloads are little tricky as some of them have subtle dependencies as storage , more on that later.

Also do not worry if the Stretch Factor (R/S ) is too great than to account only from a single queued resource. This would only point that in addition to that resource there is another resource or network traffic congestion coming into play here. This is where you can get to understand either you can add up another stages in your model of your transaction.

ii) Having done a quick check of the CPU subsystem, Network, Storage and Memory related you have to consider carefully the modelled application workload to ensure any part of it can be safely turned off ( Ideally you would like to have not more than 2 transaction types
mixed). Tuning the unnecessary application workload parts viz bugs in software causing extra burden on resources is the most beneficial and which you can work with development. This would mean avoiding some fired sql/plsql units,network roundtrips,meta sqls in transactions would bring greater advantage.

iii) A very important side effect if CPUs in one of your servers get stretched beyond its knee ( For M/M/2 it is roughly 0.65 or 65%, M/M/4 it is 0.8 ( i need to reverify as i am writing these quickly off my head the values) is to understand if at some point the OS/kernel took off your appl or backend/db usually off the CPU for a brief period when it is operating way beyond the knee cpu resource utilization. You can spot this easily with any decently sampled OS monitor tools. For eg: i have seen linux kernels sometimes when sustained >80% of cpu usage take off the appl/db from cpu.This is usually a bug in OS kernel.
Overall you want to be on CPU always to be winning.

No comments: