Friday, December 14, 2007

Does correlation help at all?

While studying how we can use statistics to better understand the performance of our applications, I came upon the concept of kurtosis. What this essentially means is that any given distribution is not normal if its kurtosis is very high...OK, that sounds like a fatal disease, what does it mean?!!!

A normal distribution just means, basically, that the vast majority of the values taken are "pretty close" to the average. This can obviously vary greatly. For example, if the average of 50 samples is 25, and the range 20-30, it is probably normally distributed. However, if the average is 15, and 95 % of the sample are less than 15, it is probably not normally distributed. How can this happen? Well in the case of Oracle's AWR, if we have most of our samples with a "db file sequential read" performance window of 500 seconds, and two with 15,000 seconds, this would mean our distribution is not "normal". As a result, our standard deviation and correlation calculations we have been using to predict performance may fly out the window of validity.

Kurtosis (and skewness) is what provides with insight as to how we test the normality of our data prior to drawing conclusions.

More to come...

Friday, November 23, 2007

Multicollinearity

The predictors should not correlate. In the stepwise selection one variable might take the prediction of the another variable into the model and the second variable will not be taken in to the model. The variables might get also inconsistent correlation coefficients (negative if the effect is positive).

If we use wait events, there will in a lot of cases be multicollinearity. In other words, if we see a lot of waits on "db file sequential read", it may be expected that we will also see significant waits on "db file scattered read".

Using v$sysstat makes this even more of a problem. For example, "user I/O wait time" will almost certainly be correlated with "physical reads".

As such, we need to programmatically identify those predictors that are not correlated with each other.

Tuesday, November 20, 2007

C callouts from mySQL

While I was trying to find a way to use the statistical functions built into mySQL for performance measurement, I discovered the ability to create user defined functions that use C callouts from within the database. I knew you could create your own functions using SQL inside the database, but this allows us to expose OS functionality/information if we need it.



This is pretty cool, because we can extend the functionality of mySQL (subject to limitations) without changing the source code.



Below is a very simple example of getting the hostname on which the server runs. This is already this functionality in mySQL (“show variables like ‘%host%’), but as noted this is just a simple example. We may have other ways that we could use this technology, such as sending mail from within the database…



[root@linux3 lib]# cat foo.c
#include
#include
#include
#include
#include
#include
#include
#include

typedef unsigned long long ulonglong;
typedef long long longlong;

#ifdef __cplusplus
extern "C" {
#endif

//init function is required, and a deinit function is optional.
//These can be used to setup for the function call, and then clean
//up afterwards.

my_bool my_machine_name_init(UDF_INIT *initid,
UDF_ARGS *args,
char *message){
return 0;
}

char* my_machine_name(UDF_INIT *initid,
UDF_ARGS *args,
char* result,
unsigned long* length,
char *is_null,
char *error){
gethostname(result,1000);
*length = strlen(result);
return result;
}


We then compile the C code above into a shared library, visible in /usr/lib so mySQL is already aware of it (in the LD_LIBRARY_PATH, but could be anywhere mySQL “knows” about)…


[root@linux3 lib]# gcc -shared -o foo.so foo.c


…and then create a function in the database that references this library.


mysql> drop function my_machine_name;
Query OK, 0 rows affected (0.00 sec)
mysql> create function my_machine_name returns string soname 'foo.so';
Query OK, 0 rows affected (0.00 sec)
mysql> select my_machine_name();
+--------------------------+
| my_machine_name() |
+--------------------------+
| linux3 |
+--------------------------+
1 row in set (0.01 sec)

mysql> exit
Bye
-bash-3.00$


Like I said, what is above is pretty simple, but I am excited about the possibilities. The code has to be thread safe, and well tested, but it does open some doors for us if the need arises.

Saturday, November 10, 2007

A recent question on CDOS was regarding how to have only one space at the end of a CSV output row. Below was my response...

SQL> create table t1110(c varchar2(10))
2 /

Table created.

SQL> insert into t1110 values('1')
2 /

1 row created.

SQL> insert into t1110 values('2 ')
2 /

1 row created.

SQL> insert into t1110 values('3 new ')
2 /

1 row created.

SQL> column len format 9
SQL> column orig format a10
SQL> column newone format a10
SQL> column new_len format 9
SQL> set trims off
SQL> select length(c) len,
2 c orig,
3 regexp_replace(c, '\W+$', ' ') newone,
4 length(regexp_replace(c, '\W+$', ' ')) new_len
5 from t1110
6 /

LEN ORIG NEWONE NEW_LEN
--- ---------- ---------- -------
1 1 1 1
5 2 2 2
7 3 new 3 new 6

SQL>

Friday, November 02, 2007

Control file writes when updating???

I had never seen this before, but while testing a migration strategy for a large table with XMLTYPE's, almost half our wait time was spent on control file sequential read and control file parallel write waits. Each of these occurred immediately after a "direct path write" wait.

When I dug a little further, I found that we had the table set to NOLOGGING. We had not set the LOBS to NOCACHE or NOLOGGING. However, evidently that was trumped, and the unrecoverable changes caused each update to write the corresponding SCN to the control file.

We set an event (10359) that eliminated control file writes for our session, and improved our performance by almost 50% (27 minutes for our test case down to 16)!!!

Monday, October 22, 2007

Connection pool example

I was really surprised there were no working examples (albeit simple) in the Oracle documentation. Below is a simple connection pool (with cached statements that eliminate soft parsing)...


import java.sql.*;
import oracle.jdbc.pool.*;

class oracleConnMgr {
OracleDataSource ods;
oracleConnMgr() {
try {
ods = new OracleDataSource();
java.util.Properties prop = new java.util.Properties();
prop.setProperty("InitialLimit", "3");
prop.setProperty("MinLimit", "3");
prop.setProperty("MaxLimit", "5");

prop.setProperty("MaxStatementsLimit", "20");
String url = "jdbc:oracle:thin:@wcasbuild01.prod.oclc.org:2484/wcasbld.prod.oclc.org";
ods.setImplicitCachingEnabled(true);
ods.setURL(url);
ods.setUser("howards");
ods.setPassword("oclc*5439");
ods.setConnectionCachingEnabled(true);
ods.setConnectionCacheProperties (prop);
}
catch (Exception e) {
e.printStackTrace();
}
}

public static void main(String args[]) {
try {
oracleConnMgr cm = new oracleConnMgr();
for (int i = 1; i <= 30; i++) {

Connection conn = cm.ods.getConnection();
ResultSet rst1= conn.createStatement().executeQuery("select sys_context('userenv','sid') from dual");
String sid = "";
while (rst1.next()) {
sid = rst1.getString(1);
}
PreparedStatement pst = conn.prepareStatement("select * from t");
ResultSet rst = pst.executeQuery();
rst.close();
pst.close();
ResultSet rset= conn.createStatement().executeQuery("select parse_calls,executions from v$sql where sql_id = '89km4qj1thh13'");
while (rset.next()) {
System.out.println("SID " + sid + " running, statement has been parsed " + rset.getString(1) + " times and executed " + rset.getString(2) + " times.");
}
conn.close();
}
}
catch (Exception e) {
e.printStackTrace();
}
}
}

Sunday, October 14, 2007

Below is a simple set of classes for performing your own statistical analysis. As noted, they are simple (no multiple regressions analysis, etc.)


class myStats {
double standardDeviation (double[] array) {
double[] arr = new double[array.length];
double av = average(array);
for (int i = 0; i < array.length; i++) {
arr[i] = (array[i] - av) * (array[i] - av);
}
double d = 0;
for (int i = 0; i < array.length; i++) {
d = d + arr[i];
}
return Math.sqrt(d / (arr.length - 1));
}

double correlation (double[] array1, double[] array2) {
double[] arr = new double[array1.length];
double av1 = average(array1);
double av2 = average(array2);
for (int i = 0; i < array1.length; i++) {
arr[i] = (array1[i] - av1) * (array2[i] - av2);
}
double d = 0;
for (int i = 0; i < array1.length; i++) {
d = d + arr[i];
}
double sd1 = standardDeviation(array1);
double sd2 = standardDeviation(array2);
return (d / (sd1 * sd2)) / (array1.length - 1);
}

double slope (double[] array1, double[] array2) {
double[] arr = new double[array1.length];
double av1 = average(array1);
double av2 = average(array2);
for (int i = 0; i < array1.length; i++) {
arr[i] = (array1[i] - av1) * (array2[i] - av2);
}
double d = 0;
for (int i = 0; i < array1.length; i++) {
d = d + arr[i];
}
double sd1 = standardDeviation(array1);
double sd2 = standardDeviation(array2);
return (sd1 / sd2) * correlation(array1,array2);
}

double yIntercept(double array1[], double array2[]) {
double d = 0;
double sl = slope(array1, array2);
double yAvg = average(array1);
double xAvg = average(array2);
d = yAvg - (sl * xAvg);
return d;
}

double average(double array[]) {
double d = 0;
double total = 0;
for (int i = 0; i < array.length; i++) {
total = total + array[i];
}
d = total / array.length;
return d;
}

double lineFit(double array1[], double array2[]) {
double yAvg = average(array1);
double sl = slope(array1, array2);
double yi = yIntercept(array1, array2);
double d1 = 0;
double squaredResidual = 0;
double squaredYVariance = 0;
for (int i = 0; i < array1.length; i++) {
squaredYVariance = squaredYVariance + ((array1[i] - yAvg) * (array1[i] - yAvg));
d1 = ((array2[i] * sl) + yi);
squaredResidual = squaredResidual + ((array1[i] - d1) * (array1[i] - d1));
}
return 1 - (squaredResidual / squaredYVariance);
}
}

class runMyStats {
public static void main(String args[]) {
double[] i2 = {5,10,20,8,4,6,12,15};
double[] i = {27,46,73,40,30,28,46,59};
//double[] i2 = {3,6,9,12};
//double[] i = {15,30,45,60};
myStats m = new myStats();
System.out.println("AVG = " + m.average(i));
System.out.println("STDEV = " + m.standardDeviation(i));
System.out.println("Correlation = " + m.correlation(i,i2));
System.out.println("Slope = " + m.slope(i,i2));
System.out.println("YIntercept = " + m.yIntercept(i,i2));
System.out.println("lineFit = " + m.lineFit(i,i2));
}
}

Thursday, October 11, 2007

Limitations of wait event analysis?

OK, so we have come full circle to know that hit ratio tuning doesn't work (for everything), and now wait event analysis has been deemed to have limitations...but what are they? I sense a *great* interview question coming on for a senior DBA position.

Quite frankly, there is no silver bullet in tuning. All that matters is how is your application performing. Can you risk walking down the hall to the restroom without a user "accidentally" tripping you on the way? Having said that, I do think wait event analysis is an excellent place to start. Humans measure most things by time..."the system is really slow today...", so analyzing the wait event interface is a great place to at least see what is, well...waiting :)

An example will be provided for where it is of good use. User A comes into DBA's cube, and says, "We installed a new version of the application software this weekend, and now the system is crawling."

The DBA immediately goes to his handy grid control interface, and runs ADDM for the last 30 minutes. He immediately sees that I/O is consuming nearly 60% of user time (db file scattered read). He looks at the top SQL, and sees there is an index that needs to be created. Voila, problem solved.

Now let's look at a different example. While being proactive one Thursday afternoon, the DBA reviewed the top waits of the past 60 days in his database. "What's this?! We have quadrupled our waits on "direct path read" in the last 30 days. What happened?!"

Since his restroom trips have proven largely uneventful in terms of being tripped, he checks other statistics. He finds a corresponding increase in "physical reads direct (lob)" over the same time period.

He finally drills down into the historical SQL detail, and finds a new statement that begin running 30 days ago (right after a software install). He compares the time spent performing I/O by the statement and it is almost perfectly correlated to his increase in direct path read waits. He finds it does an EXISTSNODE call on a cached XMLTYPE column. Instant answer! But where is the problem? There may or may not be one.

Stay tuned for more later...
Cool little function to provide statistical correlation between two events. This helped us to identify performance drivers in our database. Statistically, anything above 60 is considered "statistically significant", or correlated. If one goes up, so does the other.

create or replace function corr_events (p_event1 in varchar2,
p_event2 in varchar2)
return number is
l_num number;
begin
select round((corr(val1,val2) * 100),0)
into l_num
from (select snap_id,
last_value(snap_id)
over (order by snap_id
rows between 1 preceding and current row) snap_id1,
time_waited_micro - min(time_waited_micro)
over (order by snap_id
rows between 1 preceding and current row) val1
from dba_hist_system_event
where event_name = p_event1
order by snap_id) a,
(select snap_id,
last_value(snap_id)
over (order by snap_id
rows between 1 preceding and current row) snap_id1,
time_waited_micro - min(time_waited_micro)
over (order by snap_id
rows between 1 preceding and current row) val2
from dba_hist_system_event
where event_name = p_event2
order by snap_id) b
where a.snap_id = b.snap_id;
return l_num;
end;
/

Thursday, August 09, 2007

OK, so I was rebuilding a 15 billion row table (looong story, mostly being rebuilt due to poor initial design). At any rate, some of the data was "bad" (dupes we didn't know we had). I found out after the fact that 10g has a "log errors into" clause for all kinds of DML statements. Cool! I just wish I had known *before* I started the rebuild ...

See below for a test case...

SQL> create table t0808(c number primary key);

Table created.

SQL> begin
2 for i in 1..5 loop
3 insert into t0808 values(i);
4 end loop;
5 end;
6 /

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

SQL> exec dbms_errlog.create_error_log('T0808');

PL/SQL procedure successfully completed.

SQL> insert into t0808 select rownum from dba_users
2 log errors into err$_t0808 reject limit unlimited;

17 rows created.

SQL> commit;

Commit complete.

SQL> select * from t0808;

C
----------
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

22 rows selected.

SQL> COLUMN ora_err_mesg$ FORMAT A50
SQL> SELECT ora_err_mesg$ from err$_t0808;

ORA_ERR_MESG$
----------------------------------------------------------------------
ORA-00001: unique constraint (REP.SYS_C005175) violated
ORA-00001: unique constraint (REP.SYS_C005175) violated
ORA-00001: unique constraint (REP.SYS_C005175) violated
ORA-00001: unique constraint (REP.SYS_C005175) violated
ORA-00001: unique constraint (REP.SYS_C005175) violated

5 rows selected.

SQL>


Pretty cool, eh?

Tuesday, July 31, 2007

Read only tablespace requires no outstanding transactions...anywhere...

Nice little gotcha I received yesterday morning. Our intention is to rebuild two large tables from range partitioning to hash partitioning on another server. One of the datafiles for the tablespace that houses one of the tables in question is 1.4TB in size. When we tried to put it in read only mode
(we are using transport tablespace to segment the rebuild on another box), it hung on the enq - TX contention wait event. I found (and never knew before) that putting *any* tablespace in read only requires that no outstanding (uncommitted or rolled back) transactions exist, even in a tablespace that you are not putting into read only mode. Unfortunately, a long running transaction was about six hour from completing, so we had to wait it out.

It wasn't enormously inconvenient, but caused our rebuild to be pushed back a bit. I wonder why *no* transactions can exist? Does the kernel not "know" that there are no transactions that would affect the tablespace in question? Maybe it's just to complex to track, and its such a rare occurrence, the kernel designers just said "to be safe, let's ensure there is nothing happening when we go into read only mode"?

Tuesday, July 03, 2007

Do we need AWR?

A recent post on CDOS got me to thinking...do we need AWR? After all, AWR stands for "Automated Workload Repository". There are three words in that functionality title.

From dictionary.com:

Automated (automatic, actually): having the capability of starting, operating, moving, etc., independently
Workload: The amount of work that a machine produces or can produce in a specified time period.
Repository: A place where things may be put for safekeeping.

The normal reasons given for using AWR (or not) are as follows:

  1. You can't duplicate the functionality
  2. It is against your license terms to use AWR if you haven't paid for it.

If either is true, and depending on which one is, then you may or may not want to use it.

The first one indicates you can't do what it does. To answer this accurately, we have to document what it does. To be honest, I can't list every single piece of functionality, but I can say what I use it for (which is all that matters, anyway). I never use the Grid Control to view AWR information as a baseline. I have written my own queries to query the AWR (we *do* have it licensed). This is a piece of functionality I have always found lacking in AWR. If I want to see the historical change in db file sequential read, there is no Grid tool (of which I aam aware) that will do this for me. As a result, I wrote my own query to which I pass an event, and it will print out the historical time waited spent on this event. You can of course (and I do) apply this to OSSTAT, SYSSTAT, etc. In order to do this, of course, I must query the AWR views...or do I?

I could write my own SQL that would create tables as a copy of each v$view, and then periodically select everything from the current contents of these views into these repository tables. This is a simple solution. In fact, I wrote it this morning in 45 minutes.

But wait, is this legal? In other words, are the historical AWR views the ones that cannot be used, or is it the live v$view information that cannot be queried?

According to http://download.oracle.com/docs/cd/B19306_01/license.102/b14199.pdf, the only off limits view is v$active_session_history.

Command-Line APIs
Diagnostics Pack features can also be accessed by way of database server APIs and
command-line interfaces:
  • The DBMS_WORKLOAD_REPOSITORY package is part of this pack.
  • The DBMS_ADVISOR package is part of this pack if you specify ADDM as the value of the advisor_name parameter, or if you specify for the value of the task_name parameter any value starting with the ADDM prefix.
  • The V$ACTIVE_SESSION_HISTORY dynamic performance view is part of this pack.
  • All data dictionary views beginning with the prefix DBA_HIST_ are part of this pack, along with their underlying tables.
  • All data dictionary views with the prefix DBA_ADVISOR_ are part of this pack if queries to these views return rows with the value ADDM in the ADVISOR_NAME column or a value of ADDM* in the TASK_NAME column or the corresponding TASK_ID.
  • The following reports found in the /rdbms/admin/ directory of the Oracle home directory are part of this pack: awrrpt.sql, awrrpti.sql, addmrtp.sql, addmrpti.sql, awrrpt.sql, awrrpti.sql, addmrpt.sql, addmrpti.sql, ashrpt.sql, ashrpti.sql, awrddrpt.sql, awrddrpi.sql, awrsqrpi.sql, awrsqrpt.sql.

Thursday, June 07, 2007

Can you shutdown an instance from plsql?

SQL> begin
2 execute immediate 'alter database close';
3 execute immediate 'alter database dismount';
4 end;
5 /

PL/SQL procedure successfully completed.

SQL>

Shutting down archive processes
Archiving is disabled
Tue May 29 16:41:10 2007
ARCH shutting down
ARC1: Archival stopped
Tue May 29 16:41:15 2007
ARCH shutting down
ARC0: Archival stopped
Tue May 29 16:41:16 2007
Thread 1 closed at log sequence 24
Successful close of redo thread 1
Tue May 29 16:42:07 2007
Completed: alter database close
Tue May 29 16:42:07 2007
alter database dismount
Tue May 29 16:42:08 2007
SUCCESS: diskgroup ASM1 was dismounted
Tue May 29 16:42:09 2007
SUCCESS: diskgroup ASM2 was dismounted
SUCCESS: diskgroup ASM3 was dismounted
Tue May 29 16:42:09 2007
Completed: alter database dismount
(END)

It doesn't look like you can actually shutdown the processes, though, without external code.

Also, you can't re-mount a database shutdown in PLSQL, as you will get an

ORA-00750: database has been previously mounted and dismounted.

Thursday, May 31, 2007

Do we need CDOS?

The usenet newsgroup comp.databases.oracle.server ("CDOS") can be a contentious place. Posters there tend to be very opinionated, and sometimes (actually, frequently in my estimation) very rude. It occurred to me that most countries in the world desire to be free. As a result, "big brother" type message board moderators don't tend to go over well with computer types, who are normally free thinkers. However, I have seen good contributor after good contributer driven away from CDOS by what I frequently call the "self appointed Oracle police". There are a few very prominent posters on the newsgroup who consistently become what I would call angry with "newbies" for asking questions they (the "police") don't feel should be asked.

This creates (at least) two problems in my mind:
  1. Newbies are reluctant to ask questions for fear they may not be posting "correctly" (don't we all have enough to worry about in our lives without complaining about if someone "top posts"?), or are asking question for which they should already know the answer, but "they are too lazy to read the documentation", etc.
  2. Good posters (good, technical contributors to questions posted for discussion) are driven away by the rudeness of these posters (I can't tell you how many times I have seen this over the years...)
It occurred to me that if we don't want to be policed by a "big brother" moderator, and if we don't want to spend time in the jungle of the self appointed Oracle police, there is an answer.

Why not have a forum where readers vote posters off the board in a democratic fashion? Although this may sound somewhat inhumane on the surface, it makes sense to me. Users who don't want to confront the Oracle playground bullies could anonymously cast a vote which would contribute to the deactivation of the bullies account(s). This could be done by restricting IP, username, etc., with a voter only being able to cast one vote against any given "problem child". This would allow the group to moderate itself, giving the "unheard" voice a say in what it looks like. Granted, I haven't completely thought this out (once a week purge?), but it's a start...

Of course, this idea relies on the ability (or lack thereof) of Oracle aficionados (or just newbies looking for an answer to a one time "simple" question) to act like human beings, and not "have it in for" a given poster.

Are there legal issues with this? In other words, what if a "scorned poster" posts things from a different IP, etc., which get the forum owner in legal hot water? Would this idea itself drive away good contributors (the stooges are normally very sound, technically)?

Having said all of that, it sounds like a cool app to write, anyhoo!!!

Wednesday, May 02, 2007

Standby created without a standby controlfile?!!

Today, a junior DBA created a standby database by copying the controlfile to another host, restoring the 4TB database it serves, and recovering all available logs via RMAN. Sounds good, right? Well, it was until we went into SQL*PLUS and found that a "RECOVER STANDBY DATABASE;" command returned an exception that we were not using a standby controlfile. At this point, we were thinking that we had lost an entire day, when we were already behind on the project.

My initial thought was to issue a "alter database convert to physical standby;". However, I then found that the DBA had created the controlfile for the standby as a "normal" backup controlfile. Was this a good thing?...yeah, not so much. Oracle will not allow you to convert a backup controlfile to a standby, since a backup controlfile implies you are doing incomplete recovery, which a standby database, by definition, should not be.

To complicate matters, we were putting the datafiles on the standby in ASM, where they were on raw devices on the primary host. As a result, RMAN set newname commands were required.
After copying a "real" standby controlfile to the standby host, I found the set newname commands did not work, as this controfile did not "know" about our previously restored and renamed files into ASM on the standby. Since we obviously did not want to re-restore 4TB of datafiles, we ended up cataloging the datafile copies on the standby host so the new standby controlfile "knew" about them, i.e.,

catalog datafilecopy '+ASM_TIER2/XWCBMW/DATAFILE/AEID_LB1_4K.285.621280543' level 0;

...

This worked, and gave us much joy, as out set newname commands in RMAN now worked, and allowed recovery to commence :)
Locations of visitors to this page