OBIEE 11g Performance Challenge “Ultimate Tuning Guide”


The idea to create this new thread about OBIEE 11g Performance Tuning as it is a real challenge and very few people even the experts have limitation to cater all tuning aspects. This is only because a monster seating in front wheel which is driver of all stuff called “Weblogic” and magnitude of knowledge with evolve of Weblogic 11g has certain  limitations.

We are going to explore nuts and bolts of all possible tuning aspects which essentially can help a lot of people. The scope of this blog considers tuning OBIEE 11g(11.1.1.7 )Linux platform which is more stable and more reliable. We have huge # of customers accessing analytics everyday with high # of queries. The Customer base is huge each day 24×7 accessible across every part of the universe. The reports and pages rendering needs to run very fast without any performance concern at-least that is the expectation. Having this in mind we started tuning 11g platform and performed regression test 5 times in row. Below are the tuning what we did so far in different layers of the application. Overall it was a great testing effort conducted ! We did performance testing with 1000 concurrent user ramped up in the system within 1 minutes. We use performing testing tool to emulate user login using a software and browse through across different dashboard pages and the logout mimicking users activities.

Lets see what are the different layers of Performance tuning we have done on our platform:

  • Operating System(OS) / Hardware Level Tuning

    We have 3 Virtual servers in Horizontal cluster and each of them running only 1 instance. No Vertical clustering. All in Highly available platform with Fail-over and Load balancing capability and has future scalability opportunity. Note those servers are not Physical yet. If you are planning to do Physical that will give more dedicated power and Oracle recommend this but as of now if you plan to do with Virtual it will give you similar capability if you mastered it with proper level of tuning.
  • Assign Proper Hardware resource: Initially we had 3 CPU Cores but with the increase # of loads on CPU we increased that to 8 Virtual CPU’s to balance out the load.  We have Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz and each server has 128 GB of RAM.
  • Change TCP Max Sync: When the server is heavily loaded or has many clients with bad connections with high latency, it can result in an increase in half-open connections which is called “tcp_max_syn_backlog” . To give CPU enough room to handle too many open files and to avoid OutofMemory error while creating new native threads we have tweaked couple of below parameter in /etc/security/limits.conf  file. Let root edit file to include below: Also check the parameter using ulimit -n and ulimit -u command in Linux.
    * soft nofile 131072
    * hard nofile 131072
    * soft nproc 131072
    * hard nproc 131072
  • Change TCP Fin Timeout: cat /proc/sys/net/ipv4/tcp_fin_timeout  , this will give you the existing #. If the # is 60 make it 30 . Reducing the value of this entry, TCP/IP can release closed connections faster, providing more resources for new connections.
  • You might thing of to change: /proc/sys/net/ipv4/tcp_keepalive_time to be less than 7200. Follow this to make both the changes on TCP level is updated in system:

http://tldp.org/LDP/solrhe/Securing-Optimizing-Linux-RH-Edition-v1.3/chap6sec75.html

  • I have face a  very weird scenarios where we had to touch Cisco router settings to Allow Urgent Flag(Disable Clear Urgent Flag) to have better communication between DB server and OBIEE servers. We had issue where Usage Tracking causing trouble to render one prompts .. ! Sounds weird ! yes its true , it was a connection issue between UT table in DB and UI causing trouble for one of the multi-select prompt never rending and hanging forever. Also I have seen issues where Scheduler process has severe intermittent connection issue  with DB (RCU Schemas). Both of them are addressed by this Cisco Firewall level changes below:
  • Cisco Router Config1 Cisco Router Config2 Cisco Router Config3 Cisco Router Config4
  • Middle Tier (Weblogic EM/Console) Layer Tuning

1)  In EM against data source Increase # of Connection Pool to 150. This includes all the data source added there as per below image. For us we haven’t touch default 50 and you might want to do that. Increasing this value could cause potential impact on Database as it will try to open that many open cursors and processes to the database server.

EMDataSourceConnections EMDataSourceConnections2

2) We have tuned 64 bit JVM’s. In JRockit JVM (R28.x), the heap grows faster than before. The JVM also ensures that the heap size grows up to the maximum Java heap size (-Xmx) before an OutOfMemory error is thrown.So we have tweaked some JVM related parameter under below file:  [MiddlewareHome]/user_projects/domains/bifoundation_domain/bin/setOBIDomainEnv.sh

JVM

3) Tune Analytics (WebLogic Server app plug-in) Connection pool:

There is a connection pool between WebLogic Server analytics app and OBIPS, and the default value is 128 inadequate for a large number of concurrent users which is typically expected in a BIEE system with high users concurrency. When the number of connections reaches the maximum limit, any new requests are kept waiting. Hence, it is recommended to increase this pool to 512 for your BIEE system to support more concurrent users.

I have done parameter change in file :

[Middleware Home]/ user_projects/domains/bifoundation_domain/config/fmwconfig/biinstances/coreapplication/bridgeconfig.properties

Changed: oracle.bi.presentation.sawconnect.ConnectionPool.MaxConnections=777

4) EM STUCK THREAD PARAMETER CHANGES

This is very important as highlighted it in RED. It is also known as HOGGING Thread . In a high user volume testing environment seeing this STUCK Thread issue is pretty common in log files. We need some exceptional level of tuning to get rid of this STUCK thread issue.

WebLogic Server automatically detects when a thread in an execute queue becomes “stuck.” Because a stuck thread cannot complete its current work or accept new work, the server logs a message each time it diagnoses a stuck thread. A thread might get stuck due to various reasons.

For example: When large BI report is running and the time it takes to complete is say 800 seconds, then, as the default stuck thread timing is 600 seconds in WebLogic Server, the thread allocated for that query waits for 600 seconds and goes to stuck state.

As best practice we have changed basic Thread Parameters as below under Console. And this needs to be done all the Managed servers clustered participating nodes  including Admin server.

STUCK1 STUCK2

In our environment we have faced this issue multiple times and we took couple of thread dump when STUCK thread occurs and BI Server stops responding and stop accepting any new connections:

A typical thread dump has below details and observe that there are lot of TIMED WAITING events:

“[ACTIVE] ExecuteThread: ‘136’ for queue: ‘weblogic.kernel.Default (self-tuning)'” RUNNABLE native

java.net.SocketInputStream.socketRead0(Native Method)

java.net.SocketInputStream.read(SocketInputStream.java:129)

com.siebel.analytics.web.sawconnect.SAWConnection$NotifyInputStream.read(SAWConnection.java:165)

java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

java.io.BufferedInputStream.read(BufferedInputStream.java:237)

com.siebel.analytics.web.sawconnect.sawprotocol.SAWProtocol.readInt(SAWProtocol.java:188)

————-

“Timer-1” waiting for lock java.util.TaskQueue@3b92d7d0 TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“Timer-0” waiting for lock java.util.TaskQueue@49d85ab9 WAITING

java.lang.Object.wait(Native Method)

java.lang.Object.wait(Object.java:485)

java.util.TimerThread.mainLoop(Timer.java:483)

java.util.TimerThread.run(Timer.java:462)

“Signal Dispatcher” RUNNABLE

null

“Finalizer” waiting for lock java.lang.ref.ReferenceQueue$Lock@67646de5 WAITING

java.lang.Object.wait(Native Method)

java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)

java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)

java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

“Reference Handler” waiting for lock java.lang.ref.Reference$Lock@5178efd5 WAITING

java.lang.Object.wait(Native Method)

java.lang.Object.wait(Object.java:485)

java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)

“main” waiting for lock weblogic.t3.srvr.T3Srvr@4685b50e WAITING

———————-
“Timer-117” waiting for lock java.util.TaskQueue@46fbadd2 TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“Timer-116” waiting for lock java.util.TaskQueue@3550da66 TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“Timer-115” waiting for lock java.util.TaskQueue@4f3279e2 TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“Timer-114” waiting for lock java.util.TaskQueue@7ae00d0c TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“Timer-113” waiting for lock java.util.TaskQueue@b78cdda TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“Timer-112” waiting for lock java.util.TaskQueue@2812a918 TIMED_WAITING

java.lang.Object.wait(Native Method)

java.util.TimerThread.mainLoop(Timer.java:509)

java.util.TimerThread.run(Timer.java:462)

“QuartzScheduler_BIPublisherScheduler-NON_CLUSTERED_MisfireHandler” TIMED_WAITING

————

See more on here if you are interested to trace that down:

https://blogs.oracle.com/WebLogicServer/entry/analyzing_a_stuck_weblogic_execute

A very useful way to find the STUCK Threads are: go to Console -) Environment -) Servers -) Click Each Managed server as bi_serverx -) Click Monitoring -) Threads .

Below are 3 important parameters which tell you the server health plus Hogging thread count . If you see Hogging thread count is potentially high then you are in trouble with STUCK Thread . This below snapshot is very ideal managed server nodes with no Hogging/Stuck thread at all.

Hogger1

However this is most notorious STUCK thread issue . See the Count of thread waiting here. You must be in trouble in such scenarios:

Console

Memory consumption will tell you no issue with CPU / RAM . See this for 3 clustered nodes. Each has 128 GB memory though we are using hardly 15% of it and CPU is 50% free almost every time. So its not an Hardware bottleneck for sure. Network bottleneck ??? May be ! right now in our optical fiber network channel I am assuming its not an issue.

Hogger2

When you see Hogging thread count exceeds 20+ you will get below kind of error in dashboard page for all sessions already In and no new sessions will be able to open connection and will throw login error:

Hogger3

Now see the way you can get details of which request is Hogging . It will not tell you exact dashboard and report name but will tell you which application part/URI request at very high level. When more and more such request piled up in Middle-tier it will results a unresponsive server and application will not be able to respond which potentially leads to server crash.

Usage Tracking Hogging

More such requests you can find here . All became STUCK or HOGGING thread .

Hogger4

See JDBC Data source performance as well:

JDBC Connections usages 0 JDBC Connections usages 1Below shows 2 nodes of 4 node Exadata RAC performance so STUCK Thread is not a bottleneck due to Database and very less usage/load has been observed in DB nodes.

Exadata

HOW TO TRACE STUCK THREAD and SEE where is the bottleneck:

Always leverage EM to see the performance in sessions plus performance in GET and POST request.

server loads3

Use WLS EM to find out the load of various components:

WLS Load

Usage EM Agency Peak Load Agency Peak Load-2 EM Usage - Metric Palette

Try to use Performance URL like below to get the online details during the regression test execution  http://<server:port>/analytics/saw.dll?Perfmon

Load this , analyzed this to get some alarming #’s which will tell you precisely where the problem lies.

Perfmon - Query parameters

Check this link which will tell you how a standard Perfmon will be looks like: RATE-After Final Perf Test

But at  this point you have to understand which parameter is bottleneck and its not an easy job!

Note that, so far whatever tuning we did, it was not enough for the excellent performing environment.

So we did couple of other level of tunings in RPD layer and BI Config layer so that we can best use of the change we did in Weblogic layer.

  • OBIEE (RPD/Catalog/Config file) Level Tuning

     1) BI Server Config (The Blue one below is very important to get rid of THREAD STUCK issue)

[SERVER]

READ_ONLY_MODE = NO; # This is for both online & offline – This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control
MAX_SESSION_LIMIT = 2000;
MAX_REQUEST_PER_SESSION_LIMIT = 5000;
SERVER_THREAD_RANGE = 100-1000; # 40-100 was default
SERVER_THREAD_STACK_SIZE = 1 MB; # default is 256 KB (32 BIT mode), 1 MB (64 BIT mode), 0 for default
DB_GATEWAY_THREAD_RANGE = 40-1000; #40-200 was default
DB_GATEWAY_THREAD_STACK_SIZE = 1 MB; # default is 256 KB (32 BIT mode), 1 MB (64 BIT mode), 0 for default
HTTP_CLIENT_THREAD_RANGE = 0 – 100;
HTTP_CLIENT_THREAD_STACK_SIZE = 1 MB; # default is 256 KB (32 BIT mode), 1 MB (64 BIT mode), 0 for default
MAX_EXPANDED_SUBQUERY_PREDICATES = 8192; # default is 8192
MAX_QUERY_PLAN_CACHE_ENTRIES = 5000; # default is 1024
MAX_QUERY_PLAN_CACHE_ENTRY_SIZE = 1 MB; # default is 256 KB,(32 BIT mode), 1 MB (64 BIT mode), 0 for default
MAX_DRILLDOWN_INFO_CACHE_ENTRIES = 1024; # default is 1024
MAX_DRILLDOWN_QUERY_CACHE_ENTRIES = 1024; # default is 1024
INIT_BLOCK_CACHE_ENTRIES = 500; # default is 20

GLOBAL_CACHE_STORAGE_PATH = “/obiagency360/Shared/Cache” 2 GB; # This Configuration setting is managed by Oracle Enterprise Manager Fusion Middleware Control
MAX_GLOBAL_CACHE_ENTRIES = 1000;
CACHE_POLL_SECONDS = 300;
CLUSTER_AWARE_CACHE_LOGGING = NO;

 2) BI Presentation Server Config

<ThreadPoolDefaults>
<ChartThreadPool>
<MinThreads>100</MinThreads>
<MaxThreads>400</MaxThreads>
<MaxQueue>2048</MaxQueue>
</ChartThreadPool>
</ThreadPoolDefaults>

<Cache>

<ACLs>
<Enabled>false</Enabled>
</ACLs>

<CatalogXml>
<!– Remove from the cache everything older than N minutes –>
<MaxAgeMinutes>240</MaxAgeMinutes>
<MaxLastAccessedSeconds>14400</MaxLastAccessedSeconds>
</CatalogXml>
<Query>
<MaxEntries>5000</MaxEntries>
<!– AbsoluteMaxEntries is the enforced maximum number of entries. When this maximum is reached –>
<!– subsequent queries will fail until the maximum is no longer exceeded. –>
<AbsoluteMaxEntries>20000</AbsoluteMaxEntries>
<!– CruiseEntries is amount of entries the OracleBI Presentation server tries to maintain in its cache. –>
<CruiseEntries>3000</CruiseEntries>
<!– Forces the cache to attempt to remove an old entry when MaxEntries is exceeded. –>
<ForceLRU>true</ForceLRU>
</Query>
</Cache>

 3) Chart config [INSTANCE_HOME/config/OracleBIJavaHostComponent/coreapplication_obijh1/config.xml]

<JobManager>
<MinThreads>100</MinThreads>
<MaxThreads>200</MaxThreads>
<MaxPendingJobs>200</MaxPendingJobs>
</JobManager>

 4) RPD (Repository) Init block config

  • As we know that Starting point of OBIEE Authentication is “Authentication Init block”. We have found minor lags on query performance because it was taking more than expected time and ~7-8 seconds. Since our Init block query is heavy with almost 20 joins the concurrent user login could cause a issue for us. So we moved the entire query to Database procedure , use TABLE functions and pass the :USER to query only and get the variables populated by the returned value of the functions. We have seen 50% performance improvement in this process as it was leveraging the DB cache feature a great extent
  • During 10g to 11g upgrade RPD connection pools are not changed. However best practice is to separate connection pools for init block and connection pools used for Datawarehouse query. So we have segregated one connection pool to two and then assign 30% of Datawarehouse connection pool to Init block totaling ~ 760 connections to 1 RPD spread across 5 databases.
  • Removed unnecessary init block queries. Cleaned up errors in Init block query and removed unused variables plus connection pools
  • Apply patch at server level for “ADF Warning” which will reduce ADF warning in logs
  • Make LOGLEVEL for users not more than 0.
  • Web/Front Layer Tuning

    Bandwidth Savings: 1) Enabling HTTP compression can have a dramatic improvement on the latency of responses. By compressing static files and dynamic application responses, it will significantly reduce the remote (high latency) user response time.
    2) Improves request/response latency: Caching makes it possible to suppress the payload of the HTTP reply using the 304 status code. Minimizing round trips over the Web to re-validate cached items can make a huge difference in browser page load times.

  • This uses a web accelerator mechanism to compress HTTP files (static and dynamic content) and add caching mechanism to help rendering the UI faster. Also in URI and Content compressor the packets transmits the web reduced to 10x in number of bytes and increase the usability of network traffic and faster hand-off between clients and server request.  See this and learn more about HTTP Compression:

https://support.f5.com/kb/en-us/products/wa/manuals/product/wa_implementations_11_0_0/3.htm

Attached below which will tell how much % we gain due to HTTP Compression and we are avoiding misuse of network bandwidth.

HTTP Compress

Also it is advised not to clear browser cache each time after closing and reopening of browser. Make below changes in browser settings:

Increase the cache size to 1024 MB
o Firefox: Enter “about:config” as the url and change:  browser.cache.disk.capacity to 1024000 ,  browser.cache.disk.max_entry_size to -1
o IE: Set “Disk space to use” to 1024 under Internet Options -> Browsing history -> Settings

More details of issues can be found on deciphering below Log files as additional measure of Performance tuning…

  • Presentation Services Log (saw.log).
  • BIServer Log (nqquery.log, nqserver.log).
  • Scheduler Log (nqscheduler.log).
  • JavaHost Log (jh.log).
  • Cluster Controller Log (nqcluster.log).
  • WLS Managed Servers Log (AdminServer-diagnostic.log, AdminServer.log,
    bi_server1.log, bi_server1-diagnostic.log).

I hope this will help a lot and enjoy performance tuning 🙂 😀

You can refer Vishal’s blog which has handful of tuning details specially on DB config parameters and how to tweak them  :

http://obiee-oracledb.blogspot.com/2012/05/obiee-performance-imporvement-tips.html

Another very useful information available here on Performance topic: http://www.clearpeaks.com/blog/oracle-bi-ee-11g/obiee-11g-tuning-and-performance-monitoring