Life of OBIEE Catalog object permission – A niche Subject “ACL” cache


Access Control List(ACL) is very common terminology and not confined precisely within OBIEE. Several software use ACL to define security object permissions.An ACL specifies which users or system processes are granted access to objects, as well as what operations are allowed on given objects. We have seen ACL when we edit Objects permission or open Catalog object using Catalog Manager. I have noticed “preserve ACL” in OBIEE 10g Catalog Manager if you want to apply permission across objects. Also noticed ACL again on 11g while we embarked a very strange/weird and complicated issue.

We have OBIEE 11g .7 version 140415 running on production since launch past 5 months. Recently couple of customer raised concern about page visibility for different kind of subscription. We have customer could use multiple subscription which can be switched from a portal and based on that switch specific page should have to be visible. However we have experience a issue when a user logged in and seeing Dashboard A and logout and relogin again with different subscription Dashboard A still visible while Dashboard B should have to be visible.

Initially how you attack this problem ? I know what you are thinking of but we have very complicated system with several layers before you land to dashboard page. We have One portal which redirects to IDP (performing TCping Authentication) ,then control goes on Horizontal Cluster nodes via Load Balancer and then control goes to Analytics. So there are 3 way channels before you are landing to OBIEE page. On top of this we have deprecated 11g security system running initialization block. I know this is damn bad and Oracle recommended but we have to live with the crap because Oracle said this can still be functional. So in summary we have no Application Roles / Groups setup on EM and Console and all object based permission defined under Catalog is legacy Catalog Group permission.

So there are various aspect and point we need to carefully verify before we attack this problem. I started debugging this issue from Catalog Object level permission first , nothing seems weird but did little bit tweak here and there and it didn’t resolve the issue.

We thought it could be SSO Ping Auth issue which is bypassing the RPD security and not getting GROUPS variable populated right way. So we added WEBGROUPS and ROLES system variables in RPD and enabled all level of trace on EM and Console on oracle bi security objects including the most granular level of LOGLevel trace in instanceconfig.xml file.

We have found after logout of initial session and login new session is not picking up the permission correctly and it retains the old permission on object.

There are security masking logic on Read , Write and All permissions in each level of Catalog object while can be traceable in sawlog file which clearly depicts the issue ! But unfortunately we don’t have smoking gun yet to resolve the problem.

A search in OTN and googling around doesn’t seem to be matching to right use case ! this was a frustrating issue which bother for 4 days until actual fix has been found and Finally we have isolated our environment from SSO , Cluster , Load balancer everything to segregate the root cause of the issue. We came to know its OBIEE 11g issue for sure. We have tried several areas of tweak to fix this issue mostly on file system level , cleaning cache , clean volatileuserdata, removing cacheduserinfo, removing that user content and re-instate, GUID refresh and nothing works and we were almost out of ideas what to do next 😦 😦

FINALLY there is an idea triggered on my mind if there is any configuration can be done on refresh level which could potentially refresh dashboard object including the permission could resolve the issue. And I realize only Instanceconfig.xml is a place where you can declare on such parameter. So in search of some parameter I found below:

<Cache>
<CatalogAttributes>
<MaxAgeMinutes>1</MaxAgeMinutes>
<CleanupFrequencyMinutes>1</CleanupFrequencyMinutes>
</CatalogAttributes>
</Cache>

This will reload catalog object permission on 1 minutes interval . So potentially this could fix my issue.

And it Voila… it fixed my problem … Now right user getting correct permissions on multiple subscription switch. But still its not end of the game ! WHY ? Yes, we have seen catalog object refreshing in 1 minutes interval but my problem reoccur if multiple login/logout happens within 1 minute of interval …. Sad !! So this is not going to be a solution. Also reload catalog object frequently is not an good idea as SAW server does a hard IO work each time this happens to crawl across catalog and refresh object permission. Also there is no instanceconfig.xml parameter exist to lowered it down to seconds level  😦

Adding this parameter is as good as “Reload server metadata trigged” by Admin from analytics Admin link.  So I was in search how to invoke this automatically rather than required on demand.

Investing more energy on that reveals what is this object permission and seems closely related to ACL and we came to know below details how OBIEE handle the users in Catalog:

The catalog is designed to scale to thousands of concurrent users. To achieve this scaling, the catalog adheres to the following guidelines:

  • The average user typically only reads from the catalog and rarely, if ever, writes to it. In Release 11g, each user is constantly and automatically updating his or her Most Recently Used file, but each user’s “read” operations still far outweigh the user’s “writes” operations. Therefore, the read-to-write ratio is typically at least 100 to 1.
  • While a locking mechanism guarantees that only one user can write to an object at a time, it is rare for multiple users to attempt to write simultaneously to the same object. A feature called “lazy locking” allows users to continue reading an object even when another user is updating that object.
  • Modern file systems cache “small” files directly inside the directory record, such that reading any information on a directory simultaneously loads all small files directly into the operating system’s memory cache. Therefore, it is good practice to keep files in the catalog “small,” especially the frequently “read” .atr metadata files. When these metadata files remain small, then all the .atr files in a directory are loaded into memory with one physical hard disk read. Every file that exceeds the “small” threshold adds another physical hard disk read, which can cause a 100% degradation for each large file. In other words, use care when considering storing arbitrary “Properties” in .atr files.
  • Reading an object’s .atr metadata file using NFS is far slower than reading it directly from a local disk. For this reason, Presentation Services additionally caches all .atr files internally. This cache can become briefly “stale” when another node in the cluster writes data to the file that is newer than the data that is cached by the current node. Therefore, all nodes are refreshed according to the MaxAgeMinutes element in the instanceconfig.xml, whose default for a cluster is 5 minutes. This default setting commonly achieves the best trade-off between the possibility of stale data and the known performance impact. (The default for an environment without clusters is 60 minutes.)

Ours are pretty similar configuration of having millions of .atr file (as we have very large user base in production) in NAS storage continuously accessed by Sawserver catalog crawler.

Looking much on ACL on OTN reveals not same but similar kind of issue and I came to know about bunch of very important config parameters listed at the end of this blog.

Now I have full fledged smoking gun which could potentially resolve my issue which is below:

<ServerInstance>
 
<Cache>
<ACLs>
   <Enabled>false</Enabled>
</ACLs>
</Cache>

</ServerInstance>

So I applied in instanceconfig.xml to see the results and unfortunately sawserver was not started as this is unrecognized tag ! When I see OTN carefully and I found this can be added only 140715 version on wards and not less . So we have asked Oracle to provide patch for 140415. After following up Oracle with Sev 1 issue with several escalation we finally got the fix . We applied it , add this params in instanceconfig and it resolves the issue…. 🙂 🙂  Oracle provided the fix with 3 files , one xsd and 2 saw binary files (.so files) . So the long running nightmare is over !!! 🙂 I am sooo happy…

So we know NOW what is ACL Cache and how it could potentially behave weird and cause lot of trouble to our life and we know what should be the way to get rid of the problem around it. See below some of the other important reference on Cache under instanceconfig.xml file. Recommend is if you Disable ACL cache which Oracle confirmed has no performance impact.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Use those judiciously if you know what is that for and later don’t burn your hands … 😉

<Cache>
<UserPopulationExternalMembershipsCache>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</UserPopulationExternalMembershipsCache>

<UserPopulationAccountDetailsCacheByGUID>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</UserPopulationAccountDetailsCacheByGUID>

<UserPopulationAccountDetailsCacheByName>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</UserPopulationAccountDetailsCacheByName>

<RoleDirectMembersCache>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</RoleDirectMembersCache>

<RoleHierarchyCache>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</RoleHierarchyCache>

<ParentRolesForGrantedRolesCache>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</ParentRolesForGrantedRolesCache>

<!– this cache is NOT related to the security stack It is only for the account info in the webcat itself –>
<Accounts>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</Accounts>

<!– this cache is NOT related to the security stack It is only for the account info in the webcat itself –>
<AccountIndex>
<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>
<MaxAgeMinutes>3</MaxAgeMinutes>
</AccountIndex>

<!– This CatalogAttributes cache contains all webcat objects’ metadata (the owner, permissions, properties, attributes etc).
The internal permissions have a their own cache as to who is allowed to access the webcat object.

So to cleanup the internal permissions cache, this CatalogAttributescache must also be cleaned. –>

<CatalogAttributes>

<CleanupFrequencyMinutes>3</CleanupFrequencyMinutes>

<MaxAgeMinutes>3</MaxAgeMinutes>

</CatalogAttributes>

</Cache> 

Caching – A Cacophony in OBIEE Tuning !!!


Well ! A very interesting and controversial topic indeed , typically for Group Discussions . OBIEE  is not very much popular yet in OLAP Caching arena .There is hell lot of debates humming around regarding the BI server caching and obviously the aspects of performance tuning in effect of that .There are couple of benefits achievable in performance arena in true sense though sacrificing with couple of trade-offs .At the end of that day basic is , do Caching enable but do judiciously and make sure it would not get you into trouble in future . So need to configure the Caching considering substantial benefits as opposed to the trade-off. Here I am only focused on OBI server(SAS) level caching rather OBAW server(SAW).

Introduction

OBI has the ability to Cache query results, such that submitting the earlier processed requests do not pass through to the database and process it from filesystem.

Caching Framework

OBIEE Caching Concept

Benefits

Providing significant performance improvements as it frees up the database resources to perform other tasks and thus DB can entertain other set of user queries .

Native filesystem should perform better query processing and retrieve fast results comparing the data to and fro communication with database in network channel .

Moreover it conserves network resources by avoiding  connection to the database server so always less network traffic engagement could be markable .

This also reduces the  BI Server Engine processing overhead over queues of user requests .

However all the above things are so nice to comprehend and so difficult to manage  as providing the flexibility of faster response would lead to users an inconsistent state of data if not properly managed and planned .

This management and planning have several glitches and lots of dependencies and it is called as Cache Purging as opposed to its counterpart as Cache Seeding .

Cache Management Techniques

Do change Cache configurable parameters first .Note if disk space for Cached file exceed its limit ,entries least recently used(LRU) will be discarded and replace automatically to make room for new entries .Below mentioned techniques could be broadly distinguished as Manual , Automatic and Programmatic way to manage Caching :

Cache Seeding

There are several means to do this business :

1)  Set Global Cache parameter on – This will cache query running on any physical tables .By default for all tables in repository is cacheable .

2) Switch on Cacheable  property – This will provides table level benefit and extra customisation that which Physical tables should participate in generating query cache . E.g : sometime user would be more interested on giant Fact table caching rather tiny dimension tables .

3) Scheduling iBot – iBot could be properly configured and used for cache seeding purpose .This will silently build the cache without having any manual intervention . Possibly triggered in a time window after daily ETL load finishes or can be further customised and automated based on result retrieved from another iBot in chain request .The second iBot necessary ping the DB to identify whether a database update has been done(after ETL finish) before passing the request to trigger its chained counterpart .This will build data and query cache for a dashboard request and not for entire set of tables .

4)  Running nQCmd utility :

Another fantastic way to handle query caching which doesn’t have any dependency on the ETL load .But overhead to accumulate the superset of the actual physical query needs to be fired against a request /report and put it down in a single file to pass as parameter of nQCmd.exe . This necessarily need to be invoked after ETL run and by the ETL job itself .It could be done using remote login to BI server and trigger nQcmd  automatically and thus iBot scheduling time dependency could be avoided .

Cache Purging

A very important and crucial mechanism which should be proven good and perfect to make Caching a success story :

1) Manual  purging – Usually a dedicated dumb Administrating job ,kind an overhead for a company .This could be done simply by deleting the existing Cache TBL files or by firing purge from BI Admin Tool .This purging could be done by categories i.e. (Repository , Subject areas , Users or by physical tables) from Admintool in online mode .

2) Calling ODBC Extension – Bundled ODBC extension function like SAPurgeCacheByDatabase() ,SAPurgeCacheByQuery(),SAPurgeCacheByTable(),SAPurgeAllCache() etc . could be called to free the cache table for specific queries, tables,database or all .See Oracle documentation for details .This should be called using nQCMD utility and just after ETL load and before Cache seed to ensure there is no time window related gap and dependency .

3) Event Polling table – A nice and robust concept but not so nice to manage and lead to extra overhead .Really a good technique to make BI server aware of that DB update done and now carry forward to do your business of purging . A Cache Polling frequency is an important step and should be predefined and robust to make it a success .Poll table will be populated by a auto insert DB Trigger each time target DB tables updated .Analytics server polls that table at specific set of intervals and invalidates cache entries corresponding to updated tables.

4) Calling iBots to Purge Cache – It could be done by calling a custom java scripts .This in turn call nQCmd and ODBC extension to free cache .However the catch in this feature is again the iBots need to be scheduled just after ETL run and before Cache seed . So you might not sure about stale data if  ETL doesn’t meet the SLA .Again this could be done after setting a chained iBots to trigger the Purging activity in proper time .So ground rule is that never rely on iBot schedule on time.Lets pick the status from DB to trigger it .

Trade-off

Not purging the outdated caches , known as Stale Caches , can potentially return inaccurate results over time .Think of a situation where your Cache didn’t get purged on time after ETL load finishes. In this scenario though database has been updated but the change is not going to be reflected in your cached data as the seeded cache having outdated data at your filesystem and thus results a stale data which would throw inconsistent and wrong result .This potentially will cause huge confusion to the users mind .Thus Cache retention ,refresh time is important.

Not only that,in large platform , Caching should be separated and spreaded across  multiple folders/mountpoints for better utilization of I/O across the filesystems .The query Cache storage should be on local, high-performance ,high-reliable storage devices .The size of consumption of the Cached files would be a bottleneck for performance improvement across disk usage space .It should be used with proper Caching replacement algorithm and the policy towards the number of maximum cache entries  defined under NQSCONFIG.ini file . A potential Cache related problem found in Clustered environment where the Cache files build on one native host is not sharable with other one and this leads to be the same cache build across clustered participants as it could not be ensured that which BI server will handle which request .Until and unless the request get processed Cluster participant can’t determine there was already some Cache hit based on the query generated  and the request need not to be processed from other Clustered BI server .Again Purging need to be done from both Clustered servers .Cluster Aware Cache is propagated across Clustered Servers only when Cache is seeded via iBots and not using general dashboard request or answer based queries . So finally if you understand your business and requirements you could achieve success using Cache management by availing any of the above techniques .But again , be cautious ,Caching is not so reliable candidate at least from my experience .Hope Oracle will surely looks upon it to make it robust and non-debatable . Wish you the best and good luck with Cache implementation 🙂 🙂