Friday, April 05, 2013

Retrograde Throughput factors

Trying to keep posts without real data as short as possible and so a quick update on what are the main factors that can cause retrograde throughput in load tests esp on db layer and worth checking quickly through the aid of various tools as needed for the case.
Main causes:
  • Unix/OS syscall code path
    • Recent 64bit OS versions on x86 hardware have improved/faster syscall entry into kernel mode to address this to some extent.
  • CPU Cache to Cache communication
    • This is where hardware coherency factor shows up. eg: CPU crosscalls and CPU interconnects play important role here.
  • Main memory to Cache data movement/transfer
    • Cache misses,memory stalls play important role here. cpustat,dtrace(cpc probes/PAPI metrics), linux perf hardware events to analyze can aid here. Measuring CPI(cycles per instruction) is a useful metric along with your critical code path length.
  •  Waiting on I/O DMA request
    • Waiting for I/O DMA request to complete
  • Spinning on a latch/mutex
    • Adaptive/ticketed spinlocks  for getting a mutex play important role here.
Sometimes there is no single tool/methodology would work if you want to arrive at a logical conclusion&measuring things/important metrics in analyzing issues such as why throughput falls off or becomes retrograde so badly against expectations in a given setup.

Saturday, January 05, 2013

Exploring Parallel Processing with Oracle-part 1

 There are a lot of web content on parallel processing in general out there and i just wanted to post some of the useful links/references related to Oracle (i am not  touching Exadata or other Oracle fusion platform related technologies here) database/plsql in general as an introduction/quick reference.
The main motivating factor would of course be dependent on the use case you have in hand and a correct understanding of your existing codebase/application would help in terms of where you stand with respect to your performance/scalability requirements.
  • There are 2 basic options to parallel processing using Oracle in particular with PL/SQL.
    • You can use the Parallel Query (PQ) feature of Oracle. This parallelises SQLs by breaking large scans into a number of smaller scans and running these in parallel. You can also run PL/SQL via PQ by defining a parallel enabled PL/SQL pipeline table function.  
    • This second method is rolling your own parallel processing in PL/SQL. With 11g you can use DBMS_PARALLEL_EXECUTE. You can use DBMS_JOB to run parallel processes. You can use message queues and database pipes (or even plain SQL tables) for IPC. You can use DBMS_LOCK for implementing semaphores and mutexes.

  • Also factor in your main motivating need to do parallel processing: i.e Speed-up or ScaleUp .i.e 
    • SpeedUp means : If your current single threaded process/code takes T(1) time and you see this unacceptable and want to reduce it i.e speed up with "p" processes/threads then your new improved time would be T(p). In general T(1)/T(p)=S(p)  has an upper bound called Amdahl bound which is 1/(sigma) as p increases to higher values. This is called diminishing returns for more/higher values of p. Here this "sigma" factor is the serial factor/portion in your current workload T(1).                                  i.e T(1) = (1-sigma)*T(1) + (sigma)*T(1)  and T(1)/T(p) = S(p) = p/(1 + (p-1)*sigma) . Hence as p -> infinity the speedup factor curve with p additional threads would hit a aymptotic limit of 1/(sigma). Note: Here the sigma portion represents that serial portion in your current workload which cannot be parallelized at anytime eg: a setup and tear down steps which can run only in one thread. 
    • ScaleUp means: Here you have or foresee some scalability issues in scaling to more workload with your current single threaded code/process. If you have a workload input parameter say N with your throughput for this input as T(N). Now you want to scale to higher workloads of this input workload parameter N as linear as possible i.e closer to linear scalability i.e The chart/graph of T(N) vs N is close to linear as possible. Though linear scalability of throughput T(N) vs N is ideal and you cannot achieve you would want to stay close to it. Contention and coherency are 2 factors that affect scalability . In real world your throughput would never increase forever in a given hardware setup and you would be interested to see your boundary limit beyond which throughput may fall off i.e become retrograde. This is something you can advise your customers i.e on a given hardware you can go upto this N(max) value.
    • Sometimes you want both Speedup as well as Scaleup i.e you want the batch completion time to stay as close to same as T(1) even with increasing workloads. This may or maynot be practical always and it depends.

General  Parallel Processing Patterns  
 Above link is a very useful link to have a look at all kinds of parallel processing patterns at one shot/glance for interested folks.
1 ) Data Parallelism  In Oracle:
  • It is always advisable to read /understand that SQL level parallelism via  PDML/PDDL can be used wherever possible in Oracle EE db. i.e enterprise versions whenever your use cases demand. Along with this query level parallelism one would also be using some sort of parallel procedural processing (if you plan for multiple threads of procedural processing) to break your main job/process into sub jobs/tasks. 
Note on How Oracle implements dbms_parallel_execute: In Oracle database from 11gR2 onwards the parallel framework provided by dbms_parallel_execute package  used to submit/implement the parallel sub-tasks/jobs  via dbms_scheduler job processes governed by job_queue_processes init.ora parameter(prior to 11g one may also have to write your own parallel framework (DoItYourselfParallelism as Tom refers often and then also submit/implement the tasks via dbms_jobs package.(  dbms_jobs  is very similar to dbms_scheduler though dbms_scheduler is more sophisticated and better integration with RAC,node-affinity etc) . Loosely you can refer to these launched slave processes as different threads  processing these sub tasks (but behind the scenes they may be separate OS processes but this implementation is transparent to you and it is upto Oracle to use process or threads based on the particular port of OS. Except on windows on most unix like OS ports,Oracle may use separate OS shadow processes only for dbms scheduler jobs spawned by the subtasks . Again this is immaterial to you/transparent as Oracle would give you control over sub-tasks status,start/stop mechanisms along with knobs to adjust the degree of processing generally needed by you.

Things/Factors to consider in "data parallelism" pattern are following things:
  • Initial Setup:What is the setup needed for breaking down the data to be processed into buckets to suit your needs. The breakup may need some understanding of data so that the buckets/sub groups are balanced. May need fine-tuning or further fine grained functional breakup depending on better failure control/balance needs.
  • What is the atomic unit of work in any of your typical iteration/task. i.e It could end up being a INSERT/UPDATE i.e a DML or some simple numerical calculation steps in PLSQL( since we have already assumed data-parallelism pattern these numerical calculations would not be compute intensive or some complex mathematical ones but rather some simple ones ). 
  • Balance of PQ and parallel processing How to balance or rather mix cleverly Parallel DML/DDL also in your processing . One thing to note here is esp in RAC/cluster database setups you need to watch for inter-node overheads. i.e to limit movement of data over RAC interconnect so that data is as local as possible.(I think dbms_scheduler used by dbms_parallel_execute here would use RAC service names and you have better control over which nodes participate in your parallel processing etc. One has to explore this further and test it)
  • Knobs/End-User interface:  Some useful terminologies and patterns for reference
    • Oracle Fusion Applications have a UI Design Pattern for this here.

Wednesday, April 11, 2012

Few essentials to focus

Dr.Neil Gunther brought out clearly in a series of articles in his blog on importance of Stretch Factor(R/S) along with OS-run queue& CPU Utilization.
Infact there was surprisingly very similar attempts/approaches (there is a book written from another gentleman Dr Leonid Grinshpan on queueing theory mentioned elsewhere caught my recent attention. All this is good and reinforces the concepts clearly in my mind.

But i do find Dr Neil Gunther's blog articles very clear to the point ( Being a mathematician and done some thesis work on Probability i am usually very sharp/quick to catch anything on queuing theory with math involved along with my technical background on Telecom and Database systems)

For a performance resultant following checklist of items may be useful:

Whenever your scalability/perf test workloads clearly stretch some of the resources in your setup viz CPU and/or disk storage so that the

Stretch factor (R/S ) goes way more than acceptable SLA values it is time to stop and think a little on following lines:

i) Just a very quick check of the Hardware/Software setup to spot low hanging fruits ( This need not be too invasive and some of the things like centralized storage( NAS/filer storage if you are using) may be beyond your reach along with your server internal bus bandwidth etc. All you can do is to document what you see briefly and move on. i.e Is it a single headed NAS with a NVRAM write cache ON?, how many LUNs you had used, How were the LUNs carved out of the filesystem at filer end etc.
If you dont have clues/answers do not worry and move on. You can always understand that storage with latest trends should behave like reading of the memory with the limitation of network topology (software/hardware adaptors/HBA, NIC and mode of transport/congestion etc) to the storage. Ofcourse db workloads are little tricky as some of them have subtle dependencies as storage , more on that later.

Also do not worry if the Stretch Factor (R/S ) is too great than to account only from a single queued resource. This would only point that in addition to that resource there is another resource or network traffic congestion coming into play here. This is where you can get to understand either you can add up another stages in your model of your transaction.

ii) Having done a quick check of the CPU subsystem, Network, Storage and Memory related you have to consider carefully the modelled application workload to ensure any part of it can be safely turned off ( Ideally you would like to have not more than 2 transaction types
mixed). Tuning the unnecessary application workload parts viz bugs in software causing extra burden on resources is the most beneficial and which you can work with development. This would mean avoiding some fired sql/plsql units,network roundtrips,meta sqls in transactions would bring greater advantage.

iii) A very important side effect if CPUs in one of your servers get stretched beyond its knee ( For M/M/2 it is roughly 0.65 or 65%, M/M/4 it is 0.8 ( i need to reverify as i am writing these quickly off my head the values) is to understand if at some point the OS/kernel took off your appl or backend/db usually off the CPU for a brief period when it is operating way beyond the knee cpu resource utilization. You can spot this easily with any decently sampled OS monitor tools. For eg: i have seen linux kernels sometimes when sustained >80% of cpu usage take off the appl/db from cpu.This is usually a bug in OS kernel.
Overall you want to be on CPU always to be winning.

Saturday, March 10, 2012

Thought process skills for SPE

Dr Neil Gunther rightly mentions that hardware systems are increasingly becoming more&more commodity "black boxes" with lot of hidden complexities in each subsystem(cpu,memory,disk and network subsystems) and the onus is on the performance resultant to understand the Application atleast to get the basics if not fully for software performance&scalability. William Louth(JXInsight CTO) has also been recently stressing the same indirectly when he was presenting QoS concepts for Applications in cloud ("Applicaton is the network"). No doubt this would mean a little more steeper learning curve for anyone trying to bring about significant results in performance,scalability in shortest possible time but any such application centric efforts(be it in knowledge acquisition,understanding runtime behavior from Software performance engineering perspective)  would be more beneficial in terms of cost-benefit analysis.A glassbox if not a whitebox testing holds the key and right choice/freedom in choosing the tools/models for experienced people saves time&cost in long run to do the right/relevant work.

Thursday, March 08, 2012

Joy of sharing knowledge and being seen as ignorant

  Always it is good to share what you know for this sharing  allows you to know more and the knowledge shared grows by getting contributed by all.
 However sometimes it is best to keep quiet to let others think that they know all about what you know during discussions which can otherwise turn to unnecessary debates.
You are in fact doing good by not disturbing/agitating them by this and allowing them to go with a peaceful,happy mind.
Moreover You can also get on with your thoughts/ideas and what you wanted to do in a peaceful manner without harming each other's minds.
This is the purest form of non-violence  for i believe that any form of conflict is in essence nothing but violence.
I learnt in experience that by not  reacting in such situations does a world of good rather than trying to unsettle each other's minds.
In these days of too much emphasis on Self-managing Applications people do not realize the true potential of Self-Awareness among Humans and
the pattern recognition that humans are best at than machines.

Tuesday, November 22, 2011

Losing the Big Picture

Sometimes it happens that we tend to get lost in following some
processes/standards so religiously and lose the sight of big picture or
the essence/crux of what we intended to achieve in first place in a
timely fashion.

Following processes/standards is good but it should not hamper or come
as a stumbling block to achieve your principal objectives in time when
they matter the most.

Anyway this is not something unique to Software Industry but in general
applies to any walk of life. From time immemorial all religions
themselves have seen many new philosophies emerge inside their realm
whenever people following the existing ones lose the sight of big
picture and become too much involved in processes to the point of losing
the essence/crux.

Thursday, October 20, 2011

Man is a tool-loving animal

Excessive proliferation of tools and technologies in today's software
(this is also good in some aspects,thanks to open and/or free software
movement) ecosystem illustrates this and the plethora of seemingly
similar tools/technologies in IT can confuse even the best learned.
I often find too much engineering where there would hardly be any
significant difference in functionality and/or performance (Scaling is a
different phenomenon with different degrees of expectations and
compromises).No wonder we as humans are tool loving animals and also the
fact that no tool is even near perfect to cater/accomodate to all
situations at any given point drives this need.

5 Tools that i am personally interested in field of Application
runtimes/profiling are:

i) DTrace on Solaris ( may take some time for me to try out on linux)
ii) JRockit Flight Recorder and any plans from JRockit team to
supply a friction-free logging library ( ps: i am not starting any
raging controversy here on whether logging is good/bad or
merits/demerits over diagnostics)
iii) Azul systems diagnosis
iv) JXInsight
v) Yourkit

( in no specific order pls !!!!)

I havent used/seen DTrace for java apps so i do not know at this stage
how it could help to lead all the way up to showing stack trace in
JVM/user space esp in context of JRockit where the
code-compilation/conversion is different than Sun Hotspot JVM which
belonged to Solaris land.

Back to sql land, i had always maintained model is the code in all
walks of software and in particular Dan Tow's diagrammatic way of
visualizing a SQL to understand if CBO really chose the best possible
plan ( one can argue it is the job of CBO and why we bother the internal
algorithm of choosing a plan etc and trust me you would need it at some
point!) had been running in my mind quite for sometime since 2006.
Even wondered why someone didnt take it up to automate that to provide a
visual way of looking at things and stumbled upon this article
) which shows that people do have similar intentions as mine. In
Software esp in performance management there is no revolution but a
constant evolution of ideas and thoughts that drive things.

Converging JVMs and DTrace for Linux

The news is officially out and as expected the 2 popular JVMs(Sun
Hotspot and JRockit) are getting converged and also a DTrace port for
linux getting started to mature.
The JVM itself being a C/C++ runtime would in my opinion go through some
changes esp in context of better diagnosis and better integration with
other underlying layers in future but improvements in
performance/scalability need to be tested out as it may not be too clear
at this point