Saturday, November 22, 2008
APEX tabs
I can't tell you how many times I want to go into TextPad, or vi, and just start typing a jsp page for needed user interface functionality. On this project, however, I decided to move into the 21st century and use something more...ahem...productive.
In all fairness, it is a good tool, and in general very productive. However, the tab stuff is difficult to a mind boggling degree. If you have problems, I direct you to the single best resource I found on the web for a dummy proof guide to tabs on your APEX pages.
http://www.instantaccess.nl/wornmouse/Tabs-with-Oracle-Apex.pdf
This guy did a great job with it.
Friday, July 04, 2008
Who is generating all the redo?
Since 500GB are more than 400GB, we needed to understand who had generated all of this redo, and whether it may reoccur. Of course, we found that this was done over two months ago! I immediately mentioned that we would have to utilize the Oracle AWR to tell us who and/or what had done this. Thankfully, we retain 90 days of AWR with half hour snapshots.
Confidently, I strode back to my desk and begin to crack through the DBA_HIST views. To my surprise, I found no statistic that would tell me who was plowing through our redo logs like a creature in the movie "Tremors".
I did find a way to roughly identify this information, however, that I hope will be beneficial to you. I first checked DBA_HIST_SQL_STAT for any statement that had a far greater number of executions for the period in question, than they did on average.
I found none.
I then looked into the dba_hist_seg_stat view, and found the PHYSICAL_WRITES_DELTA column. Ah hah! I had my answer. To my dismay, I found that this didn’t have the information, either.
After mulling around for another hour or so, I went back to the dba_hist_seg_stat view. It was then I noticed the db_block_changes_delta column. I then identified the segment that experienced the most changes during the period, using what is below…
select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') snap_time,
dhsso.object_name,
sum(db_block_changes_delta)
from dba_hist_seg_stat dhss,
dba_hist_seg_stat_obj dhsso,
dba_hist_snapshot dhs
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and dhss.obj# = dhsso.obj#
and dhss.dataobj# = dhsso.dataobj#
and begin_interval_time
between to_date('2008_05_02 12','YYYY_MM_DD HH24')
and to_date('2008_05_04 12','YYYY_MM_DD HH24')
group by to_char(begin_interval_time,'YYYY_MM_DD HH24:MI')
The answer immediately jumped out at me. It was a table that I recognized, but I didn’t know why it would bubble to the top of the list.
select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') snap_time,
sum(db_block_changes_delta)
from dba_hist_seg_stat dhss,
dba_hist_seg_stat_obj dhsso,
dba_hist_snapshot dhs
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and dhss.obj# = dhsso.obj#
and dhss.dataobj# = dhsso.dataobj#
and dhsso.object_name = 'OT_PENDING'
group by to_char(begin_interval_time,'YYYY_MM_DD HH24:MI')
I found that 90% of the block changes over a three month period *all* occurred in the day and half window I was examining. Now we were really on to something!
I gathered all SQL statements that were run during the period that referenced the table, using what is below…
select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'),
dbms_lob.substr(sql_text,4000,1),
dhss.instance_number,
dhss.sql_id,executions_delta,rows_processed_delta
from dba_hist_sqlstat dhss,
dba_hist_snapshot dhs,
dba_hist_sqltext dhst
where upper(dhst.sql_text) like '%OT_PENDING%'
and dhss.snap_id=dhs.snap_id
and dhss.instance_Number=dhs.instance_number
and dhss.sql_id = dhst.sql_id
I found a statement that was being executed an average number of times, relatively speaking. However, it was updating millions of rows, whereas it normally would update hundreds, if that.
I took the SQL_ID for the statement and searched through DBA_HIST_ACTIVE_SESS_HISTORY, using what is below...
select instance_number, to_char(sample_time,'yyyy_mm_dd hh24:mi:ss'),
user_id,
program
from dba_hist_active_sess_history
where sql_id = 'bxjz5d4p6x02u'
and snap_id between 43212 and 43220
I found that a SQL*PLUS session from a development server was executing the statement, and the execution times exactly mirrored our jump in redo generation. Now I won’t even go into the lack of processes in this shop and how often developers directly access this database, as that is a topic for a different post.
However, the information for which we were searching had been found!
Wednesday, February 20, 2008
The runOracleConnMgrFailover java class does the following (let me know if you want the source code posted):
- Create a pool of 30 cached connections (OracleDataSource), with FCF enabled. These are connected to a two node RAC, linux1 and linux4.
- Create a threaded onsSubscriber objects that will post to standard output all events received from ONS.
- Sleep for five minutes, during which time we bounce one of the instances.
- At this time, we should have the connections equally distributed (I verified this but didn't show it) between linux1 and linux4.
- After the five minutes, 20 connections are created from the pool, and the instance on which they live is shown.
I chose five minutes because I wrote a simple socket server that got messages from ONS on the RAC. It appears as though the ONS software will post clients (the onsSubscriber class Oracle provides for testing ignores these messages) every two minutes (actually slightly less, about 1:58).
This gave me time to shutdown one of the instances in under two minutes and see if the cached connections reconnected to the surviving instance (as Oracle claims will happen).
While testing this, it occurred to me that ONS posts the clients when an instance dies, while not immediately, close to it!
As such, I changed five minutes to 30 seconds, and I found the connections were all on the surviving node :)
Note that the session were "stateless", in other words, they weren't doing anything when failed over. As a matter of fact, they were not "owned" by any "application" connection request until I began asking for 20 connections five minutes later.
linux3:oracle:dummy:/home/oracle>java runOracleConnMgrFailover 20
created connection manager, you can shutdown...
ONC subscriber starting
** HA event received -- Printing header:
Notification Type: database/event/service
Affected Components: null
Affected Nodes: null
Delivery Time: 1203528172772
Generating Component: database/rac/service
Generating Node: linux1
Generating Process: linux1:16358
Notification ID: linux1:1635803
Notification Creation Time:1203524376
Cluster ID: databaseClusterId
Cluster Name: databaseClusterName
Instance ID: databaseInstanceId
Instance Name: databaseInstanceName
Local Only Flag: FALSE
Cluster Only Flag: FALSE
Body: [B@10da5eb
** Body length = 97
** Event type = database/event/service
Sleep and retry.
ONC subscriber starting
ONC subscriber starting
** HA event received -- Printing header:
Notification Type: database/event/service
Affected Components: null
Affected Nodes: null
Delivery Time: 1203528174069
Generating Component: database/rac/service
Generating Node: linux1
Generating Process: linux1:16358
Notification ID: linux1:1635804
Notification Creation Time:1203524378
Cluster ID: databaseClusterId
Cluster Name: databaseClusterName
Instance ID: databaseInstanceId
Instance Name: databaseInstanceName
Local Only Flag: FALSE
Cluster Only Flag: FALSE
Body: [B@e45076
** Body length = 103
** Event type = database/event/service
Sleep and retry.
ONC subscriber starting
ONC subscriber starting
** HA event received -- Printing header:
Notification Type: database/event/service
Affected Components: null
Affected Nodes: null
Delivery Time: 1203528175107
Generating Component: database/rac/service
Generating Node: linux1
Generating Process: linux1:30339
Notification ID: linux1:3033901
Notification Creation Time:1203524379
Cluster ID: databaseClusterId
Cluster Name: databaseClusterName
Instance ID: databaseInstanceId
Instance Name: databaseInstanceName
Local Only Flag: FALSE
Cluster Only Flag: FALSE
Body: [B@15e9756
** Body length = 113
** Event type = database/event/service
Sleep and retry.
ONC subscriber starting
SID 147 started on linux4
SID 135 started on linux4
SID 152 started on linux4
SID 126 started on linux4
SID 120 started on linux4
SID 115 started on linux4
SID 133 started on linux4
SID 149 started on linux4
SID 128 started on linux4
SID 118 started on linux4
SID 122 started on linux4
SID 148 started on linux4
SID 124 started on linux4
SID 119 started on linux4
SID 132 started on linux4
SID 125 started on linux4
SID 131 started on linux4
SID 117 started on linux4
SID 121 started on linux4
SID 113 started on linux4
Wednesday, January 02, 2008
RAC connection load balancing finally explained!
Normally, when you find a resource that describes server side load balancing, it simply gives you a URL to use, or suggests you use the OracleDataSource with connection pooling "implicitly" enabled. I never understood how it works, though.
You *do* use the OracleDataSource with ConnectionCachingEnabled set to true. Using this configuration does not, at *physical* connection time, use what is the least loaded node (which is the assumption under which I labored for months). You simply provide the java class (or C#, etc.) with a URL that is configured with LOAD_BALANCE = TRUE, and it will round robin to each node an approximately equal number of connections. The load balancing comes *after* that.
Let's say you create a pool with 100 connections. The URL that asks for LOAD BALANCING will then create about 50 physical connections to each node. Your application then begins requesting a connection as needed from this pool of physical connections, frequently connecting and disconnecting.
Let's then say that a process/processes that consume(s) large amounts of CPU begins running on the first node. As new application connections requests come in, the connection manager will choose from its pool of connections based on information it has received from the ONS (Oracle Notification Services) processes on each node in the cluster. Since ONS has posted the connection manager in the java application that node 2 is less loaded, you should see less newly created "logical" connections using the first more heavily utilized node.
If you look in GV$SESSION at that point, you will still see an approximately equal number of connections to each node. However, the "logical" connections "lent" to application requests for such are taken from the second node, as it can presumably provide a better response time (if that is the service goal).