Monday, March 23, 2009

Parsing a 10046 trace with Python

Try the following...


import fileinput
import string
import sys

b=0
sqls = [0 for x in range(400)] #need to make array/list size dynamic
tok=""
cur=0
tim=0
prevtim=0

for line in fileinput.input([sys.argv[1:][0]]):
if (b==1 and line.startswith("END OF STMT")):
sqls[cur]=tok
tok=""
b=0
elif line.startswith("PARSING IN CURSOR"):
c=int(string.replace(string.split(line," ")[3],"#",""))
cur=int(c) * 1
b=1
elif line.startswith("EXEC #"):
c=int(string.replace(string.split(line,":")[0],"EXEC #",""))
if prevtim>0:
prevtim=tim
tim=int(string.replace(string.split(line,",")[9],"tim=",""))
print str((tim - prevtim) / 1000) + " milliseconds since the previous statement."
print str((tim - start) / 1000) + " milliseconds since the start of the trace."
else:
prevtim=tim
tim=int(string.replace(string.split(line,",")[9],"tim=",""))
start=tim
print "Executing " + sqls[c]
elif (b==1):
tok=tok + line

Thursday, January 01, 2009

Getting connection strings from LDAP

DISCLAIMER: I apologize for the crummy code formatting :( I am going to be moving the content to my site within the next couple of months, as I can no longer deal with it on blogspot.

After a recent migration, we found deficiencies with the process of how clients connect. As such, we are trying to standardize our repository of connection strings.

We implemented an OpenLDAP server and added the schema information for Oracle objects from OID (Oracle Internet Directory). We were able to access the strings, but found ourselves concerned with failures. We didn't want to implement a high availability naming service at this point, as our goal is more standardization rather than speed and availability.

We ended up implementing a process that grabs the string prior to application startup. We had been hardcoding the URL in an application properties file, which all have to be changed when the database location changes of course.

We implemented a simple awk script to grab the URL and pass it to the application via a -Durl= command line property.


URL=$(ldapsearch -x -h linux5 -b "cn=OracleContext,dc=appcrawler,dc=com" cn=tst10g | grep -i orclnetdescstring | awk -F = '{{ for (i = 2; i < NF; i++) printf $i"=" }print $NF}')


This will return the URL into a host variable. If you have concerns with this method for whatever reason, the following can be used to query LDAP from within Java itself.


import java.util.Hashtable;
import javax.naming.*;
import javax.naming.directory.*;

public class getOraURL {
String url;
getOraURL() throws Exception{
Hashtable env = new Hashtable(11);
env.put(Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.ldap.LdapCtxFactory");
DirContext dctx = new InitialDirContext(env);
String filter = "(cn=tst10g)";
NamingEnumeration result = dctx.search("ldap://192.168.1.60/cn=OracleContext,dc=appcrawler,dc=com", filter, null);
while (result.hasMore()) {
SearchResult match = (SearchResult) result.next();
Attributes attrs = match.getAttributes();
NamingEnumeration e = attrs.getAll();
while (e.hasMoreElements()) {
Attribute attr = (Attribute) e.nextElement();
for (int i=0; i <>
url = attr.get(i).toString();
}
}
}
}
}



The url variable that is set in the class above can be returned to the application to use as it sees fit.

Saturday, November 22, 2008

APEX tabs

We have been documenting our systems flow using Oracle APEX. It is a really user friendly, declarative programming environment. As an old 3 GL guy, however, I find this difficult at times.

I can't tell you how many times I want to go into TextPad, or vi, and just start typing a jsp page for needed user interface functionality. On this project, however, I decided to move into the 21st century and use something more...ahem...productive.

In all fairness, it is a good tool, and in general very productive. However, the tab stuff is difficult to a mind boggling degree. If you have problems, I direct you to the single best resource I found on the web for a dummy proof guide to tabs on your APEX pages.

http://www.instantaccess.nl/wornmouse/Tabs-with-Oracle-Apex.pdf

This guy did a great job with it.

Friday, July 04, 2008

Who is generating all the redo?

Recently I was tasked with answering why we generated over 500GB of redo in a little over one day, when we normally generate an average of 100GB. We are in the midst of building out a remote data center which will house a RAC protected by Dataguard. Our initial bandwidth to send redo from the production database is severely limited based on project mismanagement, so for a few months we are stuck with a link that will allow a maximum of 400GB of throughput on this link daily.

Since 500GB are more than 400GB, we needed to understand who had generated all of this redo, and whether it may reoccur. Of course, we found that this was done over two months ago! I immediately mentioned that we would have to utilize the Oracle AWR to tell us who and/or what had done this. Thankfully, we retain 90 days of AWR with half hour snapshots.

Confidently, I strode back to my desk and begin to crack through the DBA_HIST views. To my surprise, I found no statistic that would tell me who was plowing through our redo logs like a creature in the movie "Tremors".

I did find a way to roughly identify this information, however, that I hope will be beneficial to you. I first checked DBA_HIST_SQL_STAT for any statement that had a far greater number of executions for the period in question, than they did on average.

I found none.

I then looked into the dba_hist_seg_stat view, and found the PHYSICAL_WRITES_DELTA column. Ah hah! I had my answer. To my dismay, I found that this didn’t have the information, either.

After mulling around for another hour or so, I went back to the dba_hist_seg_stat view. It was then I noticed the db_block_changes_delta column. I then identified the segment that experienced the most changes during the period, using what is below…

select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') snap_time,
dhsso.object_name,
sum(db_block_changes_delta)
from dba_hist_seg_stat dhss,
dba_hist_seg_stat_obj dhsso,
dba_hist_snapshot dhs
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and dhss.obj# = dhsso.obj#
and dhss.dataobj# = dhsso.dataobj#
and begin_interval_time
between to_date('2008_05_02 12','YYYY_MM_DD HH24')
and to_date('2008_05_04 12','YYYY_MM_DD HH24')
group by to_char(begin_interval_time,'YYYY_MM_DD HH24:MI')

The answer immediately jumped out at me. It was a table that I recognized, but I didn’t know why it would bubble to the top of the list.

To confirm what I thought, I then ran the following query to examine the changes during the period to what they were historically.

select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI') snap_time,
sum(db_block_changes_delta)
from dba_hist_seg_stat dhss,
dba_hist_seg_stat_obj dhsso,
dba_hist_snapshot dhs
where dhs.snap_id = dhss.snap_id
and dhs.instance_number = dhss.instance_number
and dhss.obj# = dhsso.obj#
and dhss.dataobj# = dhsso.dataobj#
and dhsso.object_name = 'OT_PENDING'
group by to_char(begin_interval_time,'YYYY_MM_DD HH24:MI')

I found that 90% of the block changes over a three month period *all* occurred in the day and half window I was examining. Now we were really on to something!

I gathered all SQL statements that were run during the period that referenced the table, using what is below…

select to_char(begin_interval_time,'YYYY_MM_DD HH24:MI'),
dbms_lob.substr(sql_text,4000,1),
dhss.instance_number,
dhss.sql_id,executions_delta,rows_processed_delta
from dba_hist_sqlstat dhss,
dba_hist_snapshot dhs,
dba_hist_sqltext dhst
where upper(dhst.sql_text) like '%OT_PENDING%'
and dhss.snap_id=dhs.snap_id
and dhss.instance_Number=dhs.instance_number
and dhss.sql_id = dhst.sql_id

I found a statement that was being executed an average number of times, relatively speaking. However, it was updating millions of rows, whereas it normally would update hundreds, if that.

I took the SQL_ID for the statement and searched through DBA_HIST_ACTIVE_SESS_HISTORY, using what is below...


select instance_number, to_char(sample_time,'yyyy_mm_dd hh24:mi:ss'),
user_id,
program
from dba_hist_active_sess_history
where sql_id = 'bxjz5d4p6x02u'
and snap_id between 43212 and 43220

I found that a SQL*PLUS session from a development server was executing the statement, and the execution times exactly mirrored our jump in redo generation. Now I won’t even go into the lack of processes in this shop and how often developers directly access this database, as that is a topic for a different post.

However, the information for which we were searching had been found!

Wednesday, February 20, 2008

Below is the output of a test case for Fast Connection Failover. Maybe it is just me, but I can never seem to find "soup to nuts" examples of this type of thing.

The runOracleConnMgrFailover java class does the following (let me know if you want the source code posted):

  1. Create a pool of 30 cached connections (OracleDataSource), with FCF enabled. These are connected to a two node RAC, linux1 and linux4.
  2. Create a threaded onsSubscriber objects that will post to standard output all events received from ONS.
  3. Sleep for five minutes, during which time we bounce one of the instances.
    1. At this time, we should have the connections equally distributed (I verified this but didn't show it) between linux1 and linux4.
  4. After the five minutes, 20 connections are created from the pool, and the instance on which they live is shown.

I chose five minutes because I wrote a simple socket server that got messages from ONS on the RAC. It appears as though the ONS software will post clients (the onsSubscriber class Oracle provides for testing ignores these messages) every two minutes (actually slightly less, about 1:58).

This gave me time to shutdown one of the instances in under two minutes and see if the cached connections reconnected to the surviving instance (as Oracle claims will happen).

While testing this, it occurred to me that ONS posts the clients when an instance dies, while not immediately, close to it!

As such, I changed five minutes to 30 seconds, and I found the connections were all on the surviving node :)

Note that the session were "stateless", in other words, they weren't doing anything when failed over. As a matter of fact, they were not "owned" by any "application" connection request until I began asking for 20 connections five minutes later.


linux3:oracle:dummy:/home/oracle>java runOracleConnMgrFailover 20
created connection manager, you can shutdown...
ONC subscriber starting
** HA event received -- Printing header:
Notification Type: database/event/service
Affected Components: null
Affected Nodes: null
Delivery Time: 1203528172772
Generating Component: database/rac/service
Generating Node: linux1
Generating Process: linux1:16358
Notification ID: linux1:1635803
Notification Creation Time:1203524376
Cluster ID: databaseClusterId
Cluster Name: databaseClusterName
Instance ID: databaseInstanceId
Instance Name: databaseInstanceName
Local Only Flag: FALSE
Cluster Only Flag: FALSE
Body: [B@10da5eb
** Body length = 97
** Event type = database/event/service
Sleep and retry.
ONC subscriber starting
ONC subscriber starting
** HA event received -- Printing header:
Notification Type: database/event/service
Affected Components: null
Affected Nodes: null
Delivery Time: 1203528174069
Generating Component: database/rac/service
Generating Node: linux1
Generating Process: linux1:16358
Notification ID: linux1:1635804
Notification Creation Time:1203524378
Cluster ID: databaseClusterId
Cluster Name: databaseClusterName
Instance ID: databaseInstanceId
Instance Name: databaseInstanceName
Local Only Flag: FALSE
Cluster Only Flag: FALSE
Body: [B@e45076
** Body length = 103
** Event type = database/event/service
Sleep and retry.
ONC subscriber starting
ONC subscriber starting
** HA event received -- Printing header:
Notification Type: database/event/service
Affected Components: null
Affected Nodes: null
Delivery Time: 1203528175107
Generating Component: database/rac/service
Generating Node: linux1
Generating Process: linux1:30339
Notification ID: linux1:3033901
Notification Creation Time:1203524379
Cluster ID: databaseClusterId
Cluster Name: databaseClusterName
Instance ID: databaseInstanceId
Instance Name: databaseInstanceName
Local Only Flag: FALSE
Cluster Only Flag: FALSE
Body: [B@15e9756
** Body length = 113
** Event type = database/event/service
Sleep and retry.
ONC subscriber starting
SID 147 started on linux4
SID 135 started on linux4
SID 152 started on linux4
SID 126 started on linux4
SID 120 started on linux4
SID 115 started on linux4
SID 133 started on linux4
SID 149 started on linux4
SID 128 started on linux4
SID 118 started on linux4
SID 122 started on linux4
SID 148 started on linux4
SID 124 started on linux4
SID 119 started on linux4
SID 132 started on linux4
SID 125 started on linux4
SID 131 started on linux4
SID 117 started on linux4
SID 121 started on linux4
SID 113 started on linux4

Wednesday, January 02, 2008

RAC connection load balancing finally explained!

I have admittedly struggled to completely understand how RAC load balances connections. As I think I finally understand it, I thought I would post my findings.

Normally, when you find a resource that describes server side load balancing, it simply gives you a URL to use, or suggests you use the OracleDataSource with connection pooling "implicitly" enabled. I never understood how it works, though.

You *do* use the OracleDataSource with ConnectionCachingEnabled set to true. Using this configuration does not, at *physical* connection time, use what is the least loaded node (which is the assumption under which I labored for months). You simply provide the java class (or C#, etc.) with a URL that is configured with LOAD_BALANCE = TRUE, and it will round robin to each node an approximately equal number of connections. The load balancing comes *after* that.

Let's say you create a pool with 100 connections. The URL that asks for LOAD BALANCING will then create about 50 physical connections to each node. Your application then begins requesting a connection as needed from this pool of physical connections, frequently connecting and disconnecting.

Let's then say that a process/processes that consume(s) large amounts of CPU begins running on the first node. As new application connections requests come in, the connection manager will choose from its pool of connections based on information it has received from the ONS (Oracle Notification Services) processes on each node in the cluster. Since ONS has posted the connection manager in the java application that node 2 is less loaded, you should see less newly created "logical" connections using the first more heavily utilized node.

If you look in GV$SESSION at that point, you will still see an approximately equal number of connections to each node. However, the "logical" connections "lent" to application requests for such are taken from the second node, as it can presumably provide a better response time (if that is the service goal).

Friday, December 14, 2007

Does correlation help at all?

While studying how we can use statistics to better understand the performance of our applications, I came upon the concept of kurtosis. What this essentially means is that any given distribution is not normal if its kurtosis is very high...OK, that sounds like a fatal disease, what does it mean?!!!

A normal distribution just means, basically, that the vast majority of the values taken are "pretty close" to the average. This can obviously vary greatly. For example, if the average of 50 samples is 25, and the range 20-30, it is probably normally distributed. However, if the average is 15, and 95 % of the sample are less than 15, it is probably not normally distributed. How can this happen? Well in the case of Oracle's AWR, if we have most of our samples with a "db file sequential read" performance window of 500 seconds, and two with 15,000 seconds, this would mean our distribution is not "normal". As a result, our standard deviation and correlation calculations we have been using to predict performance may fly out the window of validity.

Kurtosis (and skewness) is what provides with insight as to how we test the normality of our data prior to drawing conclusions.

More to come...
Locations of visitors to this page