Keep clause
Even though these functions were already introduced in version 9, I've seen lots of code that could have used these functions, but didn't. And that's a pity because it's a wasted opportunity to write shorter and faster code. The common use case I'm talking about is when you have a detail table with a validity period. Typically with a column startdate, and optionally an enddate. For such a table, you often have to know the values of the currently valid row. An example: suppose we have a table RELATIONS and for each relation we want to know his address at a certain point in time:
Relation "Oracle Nederland" has two addresses, and its current address
being at the Hertogswetering. And fictively, relation "Ciber Nederland"
has four addresses. The current address is the Den Bosch one. And I've
also recorded a future address in Nieuwegein. Note that, in real life,
the latter three are all Ciber offices currently in use.
To get the active relation addresses on October 1st, 2012, I can use
this query:
But what if I want to retrieve the current address belonging to these
rows? In fact, this is frequently being asked in Oracle forums. Prior to
Oracle8, you would have used a query like below:
This uses a correlated subquery accessing the table (or index belonging
to) table RELATION_ADDRESSES twice. Which can be prevented from Oracle8
onwards by using an analytic function:
Here you compute the row_number when you partition the result set per
relation_id ordered by startdate in descending order. Meaning the most
recent date starting before the reference date, gets row_number 1
assigned per relation_id. By using an inline view, we can filter on the
outcome of the analytic function, and only select the rows with
row_number 1. In forums, you'll see this solution often being adviced.
Compared to the correlated subquery, this query selects only once from
table RELATION_ADDRESSES. However, you can do even better by just adding
three "keep clause" functions to the original query:
The three extra aggregate functions all do a "dense_rank last order by
startdate", meaning "sort the rows by startdate, and pick only those
rows which have the most recent startdate". If you have more rows with
the same startdate, the max function at the start tells Oracle to pick
the value with the maximum address/postal_code/city. However,
(relation_id,startdate) is unique, so ties are impossible and thus the
max function is a dummy. I also could have used min.
The query is shorter and -to me- clearer at first glance. However, the main reason for my enthusiasm for the aggregate functions FIRST and LAST is because it's just faster. To show this, let's execute those queries against a table with 300,000 rows, 100,000 relations with 3 addresses each:
Note that I created histograms with 254 buckets just to make the
optimizer see that it should full scan the table, despite the "startdate
<= :REFERENCE_DATE" predicate.
This next query should give a clue what's in the table:
So there are a couple of rows that are filtered because they're in the
future, but for most rows, the latest row is the current one.
This is the plan of the first query with the correlated subquery:
A HASH JOIN ANTI for the not exists, and a total of .66 seconds.
Next, the plan for the query with the analytic row_number function:
Note that this query takes longer than the correlated subquery above:
.97 seconds versus .66 seconds. The HASH JOIN ANTI took .49 seconds (.66
- .05 -.12) where computing the ROW_NUMBER took .84 seconds (.93 -
.09). So here, on my laptop, I have avoided .05 seconds for the INDEX
FAST FULL SCAN, but spend .35 (.84 - .49) seconds more for the
computation. Likely, when I/O is more expensive than on my laptop, the
time of the first query will go up and the times will be closer to each
other.
Now the keep clause variant:
The shortest query, the shortest plan and the fastest execution. The
SORT GROUP BY immediately reduces the number of intermediate rows from
297K to 100K, whereas the WINDOW SORT PUSHED RANK had to compute the
row_number for all 297K rows.
Much Ado About Nothing?
You can see a great and more detailed explanation of connect by with filtering here on Christian Antognini's blog.
When I was researching the new recursive subquery factoring clause one and a half year ago, and compared a standard hierarchical query on EMP using recursive subquery factoring with a query using the good old connect by, I stumbled upon a new optimizer algorithm for implementing recursive queries:
You might wonder what I did to make two exactly the same queries to use a different execution plan, but I'll address that later. First, I'd like to show there are two optimizer hints available, with which you can control which algorithm the optimizer uses:
And this was surprising to me. As the version column suggests, the no_connect_by_filtering hint and the accompanying new algorithm were already introduced in version 10.2.0.2! I checked with my old 10.2.0.4 database and it is indeed present and can be used there:
But you need the no_connect_by_filtering hint in version 10.2.0.4 for this query. If you do not provide the hint, this is the result:
Which explains why I didn't see the CONNECT BY NO FILTERING WITH START-WITH earlier. It seems that Oracle has adjusted the cost calculation of connect by queries somewhere between 10.2.0.4 and 11.2.0.1. Just look at the cost from both execution plans on 10.2.0.4 using a regular explain plan statement and a "select * from table(dbms_xplan.display):
The cost of 3 is due to the full table scan of EMP, and no additional cost is added for the hierarchical query.
These are the plans from 11.2.0.2:
The numbers from the 11.2.0.2 show more sophistication than just the cost of the table scan. The optimizer can't know how many levels deep the data is, but version 10.2.0.4 apparently picked 1, and left the total cost unchanged from 3 to 3. I'm curious to know in which version in between 10.2.0.4 and 11.2.0.2 this cost calculation changed. If anyone who is reading this, has a version in between and likes to check, please let me know in the comments. My guess would be that 11.2.0.1 contained the cost change.
What does CONNECT BY NO FILTERING WITH START-WITH do?
Let's explore this, using this table:
The data is tree shaped where each parent node has exactly 9 child nodes. One tenth of the data, with an id that ends with the digit 3, has its indicator column set to 'N'. This select query will make it clearer how the data looks like:
When hearing the word "filter", I almost immediately associate it with a WHERE clause. But a where clause in a connect by query, is not what is meant by connect by filtering. The documentation states:
So a where clause predicate is evaluated AFTER the connect by has done its job. You can see that happening here:
The "indicator = 'N'" predicate is at step 1, which is executed after the CONNECT BY WITH FILTERING at step 2. Note that although this query is executed in 11.2.0.2, the optimizer has chosen the old CONNECT BY WITH FILTERING.
Connect by filtering is done by using filters in your CONNECT BY clause. Here is an example using the predicate "indicator = 'N'" inside the CONNECT BY clause:
In the A-rows column, you can see that the connect by filtering was effective here. Only the necessary rows were being read. And this is the key difference between the two connect by algorithms: with CONNECT BY WITH FILTERING, you can filter within each recursion, whereas CONNECT BY NO FILTERING WITH START-WITH has to read everything, does an in-memory operation, and return the result. With this example, the latter is much less efficient:
100K rows were being read, and the A-time was 0.14 seconds instead of 0.01 seconds. I wondered where those 0.14 seconds went to, since the plan shows it's NOT for the full table scan. Using Tom Kyte's runstats_pkg reveals this:
The major difference is the number of rows sorted! The CONNECT BY NO FILTERING WITH START-WITH sorts all 100K rows. This is a surprise, because normally when you sort, you use memory from the PGA workarea, which shows up in your memory statistics from your execution plan. But the no filtering plan did not show those statistics (OMem, 1Mem, Used-Mem). I have no explanation for this phenomenon yet.
Let's zoom in on the sorting:
So CONNECT BY WITH FILTERING did 8 sorts (2286 - 2278) and sorted 12 rows (9425522 - 9425510), whereas CONNECT BY NO FILTERING WITH START-WITH did 2 (2288 - 2286) sorts and sorted 100,001 rows (9525523 - 9425522).
And finally, I promised to explain why the first two queries of this blogpost are identical, but show a different execution plan. The reason is simple: the first one is executed on 10.2.0.4 and the second one on 11.2.0.2.
hi go through below link for more interview questions on sql and plsql....
http://swaretesting.blogspot.in/2010/08/sql-and-plsql-interview-questions-iii.html
You may have seen an aggregate function like this in SQL queries:
max(value) keep (dense_rank first order by mydate)or this analytic variant:
max(value) keep (dense_rank last order by mydate) over (partition by relation_nr)Unfortunately, when you start searching for the "keep" clause, you won't find anything in the Oracle documentation (and hopefully because of this blogpost, people will now have a reference). Of course Oracle documents such functions. You only have to know that they are called FIRST and LAST in the SQL Language Reference.
Even though these functions were already introduced in version 9, I've seen lots of code that could have used these functions, but didn't. And that's a pity because it's a wasted opportunity to write shorter and faster code. The common use case I'm talking about is when you have a detail table with a validity period. Typically with a column startdate, and optionally an enddate. For such a table, you often have to know the values of the currently valid row. An example: suppose we have a table RELATIONS and for each relation we want to know his address at a certain point in time:
01 | SQL> create table relations |
02 | 2 ( id number not null primary key |
03 | 3 , name varchar2(30) not null |
04 | 4 ) |
05 | 5 / |
06 |
07 | Table created. |
08 |
09 | SQL> insert into relations |
10 | 2 select 1, 'Oracle Nederland' from dual union all |
11 | 3 select 2, 'Ciber Nederland' from dual |
12 | 4 / |
13 |
14 | 2 rows created. |
15 |
16 | SQL> create table relation_addresses |
17 | 2 ( relation_id number not null |
18 | 3 , startdate date not null |
19 | 4 , address varchar2(30) not null |
20 | 5 , postal_code varchar2(6) not null |
21 | 6 , city varchar2(30) not null |
22 | 7 , constraint ra_pk primary key (relation_id,startdate) |
23 | 8 , constraint ra_r_fk foreign key (relation_id) references relations(id) |
24 | 9 ) |
25 | 10 / |
26 |
27 | Table created. |
28 |
29 | SQL> insert into relation_addresses |
30 | 2 select 1, date '1995-01-01' , 'Rijnzathe 6' , '3454PV' , 'De Meern' from dual union all |
31 | 3 select 1, date '2011-01-01' , 'Hertogswetering 163-167' , '3543AS' , 'Utrecht' from dual union all |
32 | 4 select 2, date '2000-01-01' , 'Frankrijkstraat 128' , '5622AH' , 'Eindhoven' from dual union all |
33 | 5 select 2, date '2006-01-01' , 'Meerkollaan 15' , '5613BS' , 'Eindhoven' from dual union all |
34 | 6 select 2, date '2010-01-01' , 'Burgemeester Burgerslaan 40b' , '5245NH' , 'Den Bosch' from dual union all |
35 | 7 select 2, date '2015-01-01' , 'Archimedesbaan 16' , '3439ME' , 'Nieuwegein' from dual |
36 | 8 / |
37 |
38 | 6 rows created. |
39 |
40 | SQL> begin |
41 | 2 dbms_stats.gather_table_stats( user , 'relations' ); |
42 | 3 dbms_stats.gather_table_stats( user , 'relation_addresses' ); |
43 | 4 end ; |
44 | 5 / |
45 |
46 | PL/SQL procedure successfully completed. |
01 | SQL> var REFERENCE_DATE varchar2(10) |
02 | SQL> exec :REFERENCE_DATE:= '2012-10-01' |
03 |
04 | PL/SQL procedure successfully completed. |
05 |
06 | SQL> select ra.relation_id |
07 | 2 , max (ra.startdate) startdate |
08 | 3 from relation_addresses ra |
09 | 4 where ra.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) |
10 | 5 group by ra.relation_id |
11 | 6 / |
12 |
13 | RELATION_ID STARTDATE |
14 | ----------- ------------------- |
15 | 1 01-01-2011 00:00:00 |
16 | 2 01-01-2010 00:00:00 |
17 |
18 | 2 rows selected. |
01 | SQL> select ra.relation_id |
02 | 2 , ra.startdate |
03 | 3 , ra.address |
04 | 4 , ra.postal_code |
05 | 5 , ra.city |
06 | 6 from relation_addresses ra |
07 | 7 where ra.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) |
08 | 8 and not exists |
09 | 9 ( select 'a relation_address with a more recent startdate' |
10 | 10 from relation_addresses ra2 |
11 | 11 where ra2.relation_id = ra.relation_id |
12 | 12 and ra2.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) |
13 | 13 and ra2.startdate > ra.startdate |
14 | 14 ) |
15 | 15 / |
16 |
17 | RELATION_ID STARTDATE ADDRESS POSTAL CITY |
18 | ----------- ------------------- ------------------------------ ------ ------------------------------ |
19 | 1 01-01-2011 00:00:00 Hertogswetering 163-167 3543AS Utrecht |
20 | 2 01-01-2010 00:00:00 Burgemeester Burgerslaan 40b 5245NH Den Bosch |
21 |
22 | 2 rows selected. |
01 | SQL> select relation_id |
02 | 2 , startdate |
03 | 3 , address |
04 | 4 , postal_code |
05 | 5 , city |
06 | 6 from ( select ra.relation_id |
07 | 7 , ra.startdate |
08 | 8 , ra.address |
09 | 9 , ra.postal_code |
10 | 10 , ra.city |
11 | 11 , row_number() over (partition by ra.relation_id order by ra.startdate desc ) rn |
12 | 12 from relation_addresses ra |
13 | 13 where ra.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) |
14 | 14 ) |
15 | 15 where rn = 1 |
16 | 16 / |
17 |
18 | RELATION_ID STARTDATE ADDRESS POSTAL CITY |
19 | ----------- ------------------- ------------------------------ ------ ------------------------------ |
20 | 1 01-01-2011 00:00:00 Hertogswetering 163-167 3543AS Utrecht |
21 | 2 01-01-2010 00:00:00 Burgemeester Burgerslaan 40b 5245NH Den Bosch |
22 |
23 | 2 rows selected. |
01 | SQL> select ra.relation_id |
02 | 2 , max (ra.startdate) startdate |
03 | 3 , max (ra.address) keep (dense_rank last order by ra.startdate) address |
04 | 4 , max (ra.postal_code) keep (dense_rank last order by ra.startdate) postal_code |
05 | 5 , max (ra.city) keep (dense_rank last order by ra.startdate) city |
06 | 6 from relation_addresses ra |
07 | 7 where ra.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) |
08 | 8 group by ra.relation_id |
09 | 9 / |
10 |
11 | RELATION_ID STARTDATE ADDRESS POSTAL CITY |
12 | ----------- ------------------- ------------------------------ ------ ------------------------------ |
13 | 1 01-01-2011 00:00:00 Hertogswetering 163-167 3543AS Utrecht |
14 | 2 01-01-2010 00:00:00 Burgemeester Burgerslaan 40b 5245NH Den Bosch |
15 |
16 | 2 rows selected. |
The query is shorter and -to me- clearer at first glance. However, the main reason for my enthusiasm for the aggregate functions FIRST and LAST is because it's just faster. To show this, let's execute those queries against a table with 300,000 rows, 100,000 relations with 3 addresses each:
01 | SQL> create table relations |
02 | 2 ( id number not null primary key |
03 | 3 , name varchar2(30) not null |
04 | 4 ) |
05 | 5 / |
06 |
07 | Table created. |
08 |
09 | SQL> create table relation_addresses |
10 | 2 ( relation_id number not null |
11 | 3 , startdate date not null |
12 | 4 , address varchar2(30) not null |
13 | 5 , postal_code varchar2(6) not null |
14 | 6 , city varchar2(30) not null |
15 | 7 , constraint ra_pk primary key (relation_id,startdate) |
16 | 8 , constraint ra_r_fk foreign key (relation_id) references relations(id) |
17 | 9 ) |
18 | 10 / |
19 |
20 | Table created. |
21 |
22 | SQL> insert into relations |
23 | 2 select level |
24 | 3 , dbms_random.string( 'a' ,30) |
25 | 4 from dual |
26 | 5 connect by level <= 100000 |
27 | 6 / |
28 |
29 | 100000 rows created. |
30 |
31 | SQL> insert into relation_addresses |
32 | 2 select 1 + mod( level -1,100000) |
33 | 3 , date '2013-01-01' - numtodsinterval( level , 'hour' ) |
34 | 4 , dbms_random.string( 'a' ,30) |
35 | 5 , dbms_random.string( 'a' ,6) |
36 | 6 , dbms_random.string( 'a' ,30) |
37 | 7 from dual |
38 | 8 connect by level <= 300000 |
39 | 9 / |
40 |
41 | 300000 rows created. |
42 |
43 | SQL> begin |
44 | 2 dbms_stats.gather_table_stats |
45 | 3 ( user |
46 | 4 , 'relations' |
47 | 5 , cascade => true |
48 | 6 , method_opt=> 'FOR ALL INDEXED COLUMNS SIZE 254' |
49 | 7 , estimate_percent=>100 |
50 | 8 ); |
51 | 9 dbms_stats.gather_table_stats |
52 | 10 ( user |
53 | 11 , 'relation_addresses' |
54 | 12 , cascade => true |
55 | 13 , method_opt=> 'FOR ALL INDEXED COLUMNS SIZE 254' |
56 | 14 , estimate_percent=>100 |
57 | 15 ); |
58 | 16 end ; |
59 | 17 / |
60 |
61 | PL/SQL procedure successfully completed. |
01 | SQL> select * |
02 | 2 from relation_addresses |
03 | 3 where relation_id in (1,2,99999,100000) |
04 | 4 order by relation_id |
05 | 5 , startdate |
06 | 6 / |
07 |
08 | RELATION_ID STARTDATE ADDRESS POSTAL CITY |
09 | ----------- ------------------- ------------------------------ ------ ------------------------------ |
10 | 1 09-03-1990 15:00:00 tKgXePxuAIdhFBNJLIRRjodrlJzGOl vPIAbL pNkbFHTJPrVuDIYLxsCfUfetBsKJIE |
11 | 1 05-08-2001 07:00:00 LybVzfpzoQzXjpCAdkSZrkYrwUtZtL cWJwFe IczTRyjITWCJIOErccfITVvsqRVyMF |
12 | 1 31-12-2012 23:00:00 lNEwsdYhbwdqRxHTSCTCykgICxiXKL oXzHQF YfyKFmiboCWfmNLjVLZoKmUDoMFaDu |
13 | 2 09-03-1990 14:00:00 svOylQPkbyfympSXRMeyudfFErFvlO MLFdpG LTtAKdrpUmCwFgqEmoKxnUtWecwgcV |
14 | 2 05-08-2001 06:00:00 BsRCUviBiLHaAEjyRVnIedRAWzuVSe DlBlZW ErQmCkDgNDTMOdZzceFYrMXnZmmjxg |
15 | 2 31-12-2012 22:00:00 wqdFdXoBdmmCooLtGfWOMKukIMrDlI geRRHz DaPpWHOOdWgbjLaRkxfFDUIPgVgvEt |
16 | 99999 12-10-1978 01:00:00 FsXOjUdNIgjjGjnWpJjTTscbcuqsxa PdhVtm qOskmLwRlngSEihmlpYhmNHhvtrpBc |
17 | 99999 09-03-1990 17:00:00 sqoKYNeDntZtAUSmSDMtIQZloTSVeD uGPszi GIDctptEomcGzYGYhUGhKHgDRZJCmY |
18 | 99999 05-08-2001 09:00:00 fhHGwuGPIHSOaKdjDvDcqTzsbHZzqR tpaLAP rVYCmijzqJmhlnZZLXkHpgFmLAEiTS |
19 | 100000 12-10-1978 00:00:00 WwxfHcVfkFfItgcXfjPnKTiATlHjao nSOjSn vZNRsRySNPlmQKgCJjcpiEOhQIxzoy |
20 | 100000 09-03-1990 16:00:00 cGcVPMsFyxCBrnsZtMYBnaAflXiNff NVKRIr SseFWkWyUDgaPpbxdmENdLjurGbJPK |
21 | 100000 05-08-2001 08:00:00 dRfCmqdmbhcmaMvyYBpewPsFBCVdlG BMQWLY YPaAGnKKUkfdnAeAyLYeUBfXwezsEo |
22 |
23 | 12 rows selected. |
01 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'iostats last' )) |
02 | 2 / |
03 |
04 | PLAN_TABLE_OUTPUT |
05 | --------------------------------------------------------------------------------------------------------------------------------------- |
06 | SQL_ID d6p5uh67h65yb, child number 0 |
07 | ------------------------------------- |
08 | select ra.relation_id , ra.startdate , ra.address , |
09 | ra.postal_code , ra.city from relation_addresses ra where |
10 | ra.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) and not exists |
11 | ( select 'a relation_address with a more recent startdate' |
12 | from relation_addresses ra2 where ra2.relation_id = |
13 | ra.relation_id and ra2.startdate <= |
14 | to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) and ra2.startdate > |
15 | ra.startdate ) |
16 |
17 | Plan hash value: 3749094337 |
18 |
19 | --------------------------------------------------------------------------------------------------------------- |
20 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | Reads | |
21 | --------------------------------------------------------------------------------------------------------------- |
22 | | 0 | SELECT STATEMENT | | 1 | | 100K|00:00:00.66 | 15071 | 3681 | |
23 | |* 1 | HASH JOIN RIGHT ANTI | | 1 | 2978 | 100K|00:00:00.66 | 15071 | 3681 | |
24 | |* 2 | INDEX FAST FULL SCAN| RA_PK | 1 | 297K| 297K|00:00:00.05 | 1240 | 35 | |
25 | |* 3 | TABLE ACCESS FULL | RELATION_ADDRESSES | 1 | 297K| 297K|00:00:00.12 | 13831 | 3646 | |
26 | --------------------------------------------------------------------------------------------------------------- |
27 |
28 | Predicate Information (identified by operation id): |
29 | --------------------------------------------------- |
30 |
31 | 1 - access( "RA2" . "RELATION_ID" = "RA" . "RELATION_ID" ) |
32 | filter( "RA2" . "STARTDATE" > "RA" . "STARTDATE" ) |
33 | 2 - filter( "RA2" . "STARTDATE" <=TO_DATE(:REFERENCE_DATE, 'yyyy-mm-dd' )) |
34 | 3 - filter( "RA" . "STARTDATE" <=TO_DATE(:REFERENCE_DATE, 'yyyy-mm-dd' )) |
35 |
36 |
37 | 30 rows selected. |
01 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'iostats last' )) |
02 | 2 / |
03 |
04 | PLAN_TABLE_OUTPUT |
05 | --------------------------------------------------------------------------------------------------------------------------------------- |
06 | SQL_ID 1zd4wqtxkc2vz, child number 0 |
07 | ------------------------------------- |
08 | select relation_id , startdate , address , postal_code |
09 | , city from ( select ra.relation_id , ra.startdate |
10 | , ra.address , ra.postal_code , |
11 | ra.city , row_number() over (partition by ra.relation_id |
12 | order by ra.startdate desc ) rn from relation_addresses ra |
13 | where ra.startdate <= to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) |
14 | ) where rn = 1 |
15 |
16 | Plan hash value: 2795878473 |
17 |
18 | ------------------------------------------------------------------------------------------------------------------ |
19 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | Reads | |
20 | ------------------------------------------------------------------------------------------------------------------ |
21 | | 0 | SELECT STATEMENT | | 1 | | 100K|00:00:00.97 | 7238 | 3646 | |
22 | |* 1 | VIEW | | 1 | 297K| 100K|00:00:00.97 | 7238 | 3646 | |
23 | |* 2 | WINDOW SORT PUSHED RANK| | 1 | 297K| 200K|00:00:00.93 | 7238 | 3646 | |
24 | |* 3 | TABLE ACCESS FULL | RELATION_ADDRESSES | 1 | 297K| 297K|00:00:00.09 | 7238 | 3646 | |
25 | ------------------------------------------------------------------------------------------------------------------ |
26 |
27 | Predicate Information (identified by operation id): |
28 | --------------------------------------------------- |
29 |
30 | 1 - filter( "RN" =1) |
31 | 2 - filter(ROW_NUMBER() OVER ( PARTITION BY "RA" . "RELATION_ID" ORDER BY |
32 | INTERNAL_FUNCTION( "RA" . "STARTDATE" ) DESC )<=1) |
33 | 3 - filter( "RA" . "STARTDATE" <=TO_DATE(:REFERENCE_DATE, 'yyyy-mm-dd' )) |
34 |
35 |
36 | 29 rows selected. |
01 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'iostats last' )) |
02 | 2 / |
03 |
04 | PLAN_TABLE_OUTPUT |
05 | --------------------------------------------------------------------------------------------------------------------------------------- |
06 | SQL_ID dcw8tyyqtu2kk, child number 0 |
07 | ------------------------------------- |
08 | select ra.relation_id , max (ra.startdate) startdate , |
09 | max (ra.address) keep (dense_rank last order by ra.startdate) address |
10 | , max (ra.postal_code) keep (dense_rank last order by ra.startdate) |
11 | postal_code , max (ra.city) keep (dense_rank last order by |
12 | ra.startdate) city from relation_addresses ra where ra.startdate <= |
13 | to_date(:REFERENCE_DATE, 'yyyy-mm-dd' ) group by ra.relation_id |
14 |
15 | Plan hash value: 2324030966 |
16 |
17 | ------------------------------------------------------------------------------------------------------------ |
18 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | Reads | |
19 | ------------------------------------------------------------------------------------------------------------ |
20 | | 0 | SELECT STATEMENT | | 1 | | 100K|00:00:00.55 | 7238 | 3646 | |
21 | | 1 | SORT GROUP BY | | 1 | 100K| 100K|00:00:00.55 | 7238 | 3646 | |
22 | |* 2 | TABLE ACCESS FULL | RELATION_ADDRESSES | 1 | 297K| 297K|00:00:00.09 | 7238 | 3646 | |
23 | ------------------------------------------------------------------------------------------------------------ |
24 |
25 | Predicate Information (identified by operation id): |
26 | --------------------------------------------------- |
27 |
28 | 2 - filter( "RA" . "STARTDATE" <=TO_DATE(:REFERENCE_DATE, 'yyyy-mm-dd' )) |
29 |
30 |
31 | 24 rows selected. |
Much Ado About Nothing?
I was reading this presentation PDF of Hugh Darwen recently, called How To Handle Missing Information Without Using NULL.
Several great thinkers and founders of the relational theory consider
NULL as the thing that should not be. For example, one slide in the
above mentioned PDF is titled SQL's Nulls Are A Disaster. And I found a paper with the amusing title The Final Null In The Coffin.
I can understand the critique. The introduction of NULL leads to three valued logic, which makes programs much more complex and harder to prove correct. All database professionals likely have been bitten by NULLs several times during their career, myself included. And a NULL can have several interpretations. By using NULL, you are not making clear what is meant. If the value for column hair_colour is NULL, does it mean the person is bald? Or do you know the person has hair, but you just don't know what colour? Or can the person be bald or have hair, but you just don't know which one applies? Or is the person in the midst of a hair colouring exercise and you only temporarily don't know the colour? If you're creative, I'm sure you can come up with other interpretations as well.
On the other hand, the theorists don't have to build database applications for end users who like reasonable response times, and I do. Avoiding nulls at all cost typically leads to a data model that has more tables than needed, requiring more joins and therefore making queries slower. So I have to make a trade off. In general I try to avoid nullable columns as much as possible, for example by chosing subtype implementations instead of supertype implementations, and by modelling entity subtypes in the first place, but I will never let it noticeably slow down my application. At my current job, I'm making a data model right now. Having read all use cases, I know how the data will be used and so I know where in the model there is room to avoid an extra nullable column. One thing I'll never voluntarily do though, is make up strange outlier values just to get rid of the null.
Any way, I was curious to see how Hugh Darwen handles missing information without using nulls. In his paper, he has a concise example, which I'll translate to Oracle syntax in this blogpost to see what practically needs to happen to avoid nulls in his example. He starts with this table:
Which contains four NULL values. The meaning of those NULL values can't
be seen from this table, but this is what they are meant to be:
Here we achieved a data model where every NULL has been banned out.
Now what if we'd like to simulate a query against the PERS_INFO table? Darwen uses this expression to transform the seven tables back to the PERS_INFO table:
Translated to Oracle syntax, this becomes:
Very elaborate, but the optimizer does a great job at simplifying the
query under the covers, as can be seen in this execution plan:
If I had to build the PERS_INFO table back again with a query myself, I'd use this shorter query with six left outer joins:
Although, as you can see below, the plan doesn't really improve:
But the two plans above are really complex, compared with a simple query against the PERS_INFO table with nullable columns:
If queries like this are not very frequent in your database, you might
want to take this extra work for granted and avoid the NULL. But you
need to consider something else as well: the new schema requires much
more constraints. Using just the PERS_INFO table, a single primary key
constraint on the Id column is all you need. But for the new model,
Darwen describes 9, but really 15 constraints:
And so are constraints 6, 7 and 8:
But constraint 2 says that the Id in table CALLED is a foreign
distributed key. And constraints 3, 4 and 5 say the Id's of tables
DOES_JOB, JOB_UNK and UNEMPLOYED are a distributed key. Oracle doesn't
have declarative support for distributed keys or for foreign distributed
keys. We could write database trigger code to implement this, which is
very hard to do correct or we could use the materialized view trick to
have the condition validated at the end of a transaction, instead of at
the end of the statement, which also has its downsides.
And such deferred constraint checking is explicitly ruled out by The
Third Manifesto as well. Nevertheless, here is how it can be done.
The distributed key (constraints 3, 4 and 5):
And to show that the distributed key implementation works:
And the foreign distributed key ("Every row in CALLED has a matching row
in either DOES_JOB, JOB_UNK, or UNEMPLOYED.") can be implemented like
this:
And some proof that this implementation works:
Would I go through the extra trouble of an implementation with 6 more
tables, 14 extra constraints and worse performance like above? It
depends. It depends on how often the data is queried, and on how often
it is updated concurrently. And on whether the distinction between the
possible multiple meanings of NULL is relevant in my case. And whether I
have sufficient extra time to implement it. Using Oracle, probably most
often, I won't.
I can understand the critique. The introduction of NULL leads to three valued logic, which makes programs much more complex and harder to prove correct. All database professionals likely have been bitten by NULLs several times during their career, myself included. And a NULL can have several interpretations. By using NULL, you are not making clear what is meant. If the value for column hair_colour is NULL, does it mean the person is bald? Or do you know the person has hair, but you just don't know what colour? Or can the person be bald or have hair, but you just don't know which one applies? Or is the person in the midst of a hair colouring exercise and you only temporarily don't know the colour? If you're creative, I'm sure you can come up with other interpretations as well.
On the other hand, the theorists don't have to build database applications for end users who like reasonable response times, and I do. Avoiding nulls at all cost typically leads to a data model that has more tables than needed, requiring more joins and therefore making queries slower. So I have to make a trade off. In general I try to avoid nullable columns as much as possible, for example by chosing subtype implementations instead of supertype implementations, and by modelling entity subtypes in the first place, but I will never let it noticeably slow down my application. At my current job, I'm making a data model right now. Having read all use cases, I know how the data will be used and so I know where in the model there is room to avoid an extra nullable column. One thing I'll never voluntarily do though, is make up strange outlier values just to get rid of the null.
Any way, I was curious to see how Hugh Darwen handles missing information without using nulls. In his paper, he has a concise example, which I'll translate to Oracle syntax in this blogpost to see what practically needs to happen to avoid nulls in his example. He starts with this table:
01 | SQL> select * |
02 | 2 from pers_info |
03 | 3 / |
04 |
05 | ID NAME JOB SALARY |
06 | ---------- ---------- ---------- ---------- |
07 | 1234 Anne Lawyer 100000 |
08 | 1235 Boris Banker |
09 | 1236 Cindy 70000 |
10 | 1237 Davinder |
11 |
12 | 4 rows selected. |
- Boris earns something, but we don't know how much
- Cindy does some job, but we don't know what it is
- Davinder doesn't have a job
- Davinder doesn't have a salary
01 | SQL> select * |
02 | 2 from called |
03 | 3 / |
04 |
05 | ID NAME |
06 | ---------- -------- |
07 | 1234 Anne |
08 | 1235 Boris |
09 | 1236 Cindy |
10 | 1237 Davinder |
11 |
12 | 4 rows selected. |
13 |
14 | SQL> select * |
15 | 2 from does_job |
16 | 3 / |
17 |
18 | ID JOB |
19 | ---------- ------ |
20 | 1234 Lawyer |
21 | 1235 Banker |
22 |
23 | 2 rows selected. |
24 |
25 | SQL> select * |
26 | 2 from job_unk |
27 | 3 / |
28 |
29 | ID |
30 | ---------- |
31 | 1236 |
32 |
33 | 1 row selected. |
34 |
35 | SQL> select * |
36 | 2 from unemployed |
37 | 3 / |
38 |
39 | ID |
40 | ---------- |
41 | 1237 |
42 |
43 | 1 row selected. |
44 |
45 | SQL> select * |
46 | 2 from earns |
47 | 3 / |
48 |
49 | ID SALARY |
50 | ---------- ---------- |
51 | 1234 100000 |
52 | 1236 70000 |
53 |
54 | 2 rows selected. |
55 |
56 | SQL> select * |
57 | 2 from salary_unk |
58 | 3 / |
59 |
60 | ID |
61 | ---------- |
62 | 1235 |
63 |
64 | 1 row selected. |
65 |
66 | SQL> select * |
67 | 2 from unsalaried |
68 | 3 / |
69 |
70 | ID |
71 | ---------- |
72 | 1237 |
73 |
74 | 1 row selected. |
Now what if we'd like to simulate a query against the PERS_INFO table? Darwen uses this expression to transform the seven tables back to the PERS_INFO table:
01 | WITH (EXTEND JOB_UNK ADD ‘Job unknown’ AS Job_info) AS T1, |
02 | (EXTEND UNEMPLOYED ADD ‘Unemployed’ AS Job_info) AS T2, |
03 | (DOES_JOB RENAME (Job AS Job_info)) AS T3, |
04 | (EXTEND SALARY_UNK ADD ‘Salary unknown’ AS Sal_info) AS T4, |
05 | (EXTEND UNSALARIED ADD ‘Unsalaried’ AS Sal_info) AS T5, |
06 | (EXTEND EARNS ADD CHAR (Salary) AS Sal_info) AS T6, |
07 | (T6 { ALL BUT Salary }) AS T7, |
08 | ( UNION ( T1, T2, T3 )) AS T8, |
09 | ( UNION ( T4, T5, T7 )) AS T9, |
10 | ( JOIN ( CALLED, T8, T9 )) AS PERS_INFO : |
11 | PERS_INFO |
01 | SQL> with t1 as |
02 | 2 ( select id |
03 | 3 , 'Job unknown' as job_info |
04 | 4 from job_unk |
05 | 5 ) |
06 | 6 , t2 as |
07 | 7 ( select id |
08 | 8 , 'Unemployed' as job_info |
09 | 9 from unemployed |
10 | 10 ) |
11 | 11 , t3 as |
12 | 12 ( select id |
13 | 13 , job as job_info |
14 | 14 from does_job |
15 | 15 ) |
16 | 16 , t4 as |
17 | 17 ( select id |
18 | 18 , 'Salary unknown' as sal_info |
19 | 19 from salary_unk |
20 | 20 ) |
21 | 21 , t5 as |
22 | 22 ( select id |
23 | 23 , 'Unsalaried' as sal_info |
24 | 24 from unsalaried |
25 | 25 ) |
26 | 26 , t6 as |
27 | 27 ( select id |
28 | 28 , salary |
29 | 29 , to_char(salary, 'fm999G999' ) as sal_info |
30 | 30 from earns |
31 | 31 ) |
32 | 32 , t7 as |
33 | 33 ( select id |
34 | 34 , sal_info |
35 | 35 from t6 |
36 | 36 ) |
37 | 37 , t8 as |
38 | 38 ( select id |
39 | 39 , job_info |
40 | 40 from t1 |
41 | 41 union all |
42 | 42 select id |
43 | 43 , job_info |
44 | 44 from t2 |
45 | 45 union all |
46 | 46 select id |
47 | 47 , job_info |
48 | 48 from t3 |
49 | 49 ) |
50 | 50 , t9 as |
51 | 51 ( select id |
52 | 52 , sal_info |
53 | 53 from t4 |
54 | 54 union all |
55 | 55 select id |
56 | 56 , sal_info |
57 | 57 from t5 |
58 | 58 union all |
59 | 59 select id |
60 | 60 , sal_info |
61 | 61 from t7 |
62 | 62 ) |
63 | 63 , pers_info as |
64 | 64 ( select c.id |
65 | 65 , c. name |
66 | 66 , j.job_info |
67 | 67 , s.sal_info |
68 | 68 from called c |
69 | 69 inner join t8 j on (c.id = j.id) |
70 | 70 inner join t9 s on (c.id = s.id) |
71 | 71 ) |
72 | 72 select * |
73 | 73 from pers_info |
74 | 74 / |
75 |
76 | ID NAME JOB_INFO SAL_INFO |
77 | ---------- -------- ----------- -------------- |
78 | 1235 Boris Banker Salary unknown |
79 | 1237 Davinder Unemployed Unsalaried |
80 | 1234 Anne Lawyer 100,000 |
81 | 1236 Cindy Job unknown 70,000 |
82 |
83 | 4 rows selected. |
01 | SQL> select * |
02 | 2 from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
03 | 3 / |
04 |
05 | PLAN_TABLE_OUTPUT |
06 | --------------------------------------------------------------------------------------------------------------------------------------- |
07 | SQL_ID bmrtdy0jad18p, child number 0 |
08 | ------------------------------------- |
09 | with t1 as ( select id , 'Job unknown' as job_info from |
10 | job_unk ) , t2 as ( select id , 'Unemployed' as job_info |
11 | from unemployed ) , t3 as ( select id , job as job_info from |
12 | does_job ) , t4 as ( select id , 'Salary unknown' as sal_info |
13 | from salary_unk ) , t5 as ( select id , 'Unsalaried' as |
14 | sal_info from unsalaried ) , t6 as ( select id , salary |
15 | , to_char(salary, 'fm999G999' ) as sal_info from earns ) , t7 as ( |
16 | select id , sal_info from t6 ) , t8 as ( select id , |
17 | job_info from t1 union all select id , job_info |
18 | from t2 union all select id , job_info from t3 ) , t9 |
19 | as ( select id , sal_info from t4 union all select id |
20 | , sal_info from t5 union all select id , sal_info |
21 | from t7 ) , pers_info as ( select c.id , c. name , |
22 | j.job_info , s.sal_info from called c inner join t8 |
23 | j on (c.id = j.id) |
24 |
25 | Plan hash value: 583520090 |
26 |
27 | ------------------------------------------------------------------------------------------------------------------------- |
28 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | OMem | 1Mem | Used-Mem | |
29 | ------------------------------------------------------------------------------------------------------------------------- |
30 | | 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 14 | | | | |
31 | |* 1 | HASH JOIN | | 1 | 4 | 4 |00:00:00.01 | 14 | 1011K| 1011K| 550K (0)| |
32 | |* 2 | HASH JOIN | | 1 | 4 | 4 |00:00:00.01 | 8 | 1180K| 1180K| 548K (0)| |
33 | | 3 | TABLE ACCESS FULL | CALLED | 1 | 4 | 4 |00:00:00.01 | 2 | | | | |
34 | | 4 | VIEW | | 1 | 4 | 4 |00:00:00.01 | 6 | | | | |
35 | | 5 | UNION - ALL | | 1 | | 4 |00:00:00.01 | 6 | | | | |
36 | | 6 | TABLE ACCESS FULL | JOB_UNK | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
37 | | 7 | TABLE ACCESS FULL | UNEMPLOYED | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
38 | | 8 | TABLE ACCESS FULL | DOES_JOB | 1 | 2 | 2 |00:00:00.01 | 2 | | | | |
39 | | 9 | VIEW | | 1 | 4 | 4 |00:00:00.01 | 6 | | | | |
40 | | 10 | UNION - ALL | | 1 | | 4 |00:00:00.01 | 6 | | | | |
41 | | 11 | TABLE ACCESS FULL | SALARY_UNK | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
42 | | 12 | TABLE ACCESS FULL | UNSALARIED | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
43 | | 13 | TABLE ACCESS FULL | EARNS | 1 | 2 | 2 |00:00:00.01 | 2 | | | | |
44 | ------------------------------------------------------------------------------------------------------------------------- |
45 |
46 | Predicate Information (identified by operation id): |
47 | --------------------------------------------------- |
48 |
49 | 1 - access( "C" . "ID" = "S" . "ID" ) |
50 | 2 - access( "C" . "ID" = "J" . "ID" ) |
51 |
52 |
53 | 45 rows selected. |
01 | SQL> select c.id |
02 | 2 , c. name |
03 | 3 , coalesce (j.job,nvl2(ju.id, 'Job unknown' , null ),nvl2(ue.id, 'Unemployed' , null )) job_info |
04 | 4 , coalesce (to_char(e.salary, 'fm999G999' ),nvl2(su.id, 'Salary unknown' , null ),nvl2(us.id, 'Unsalaried' , null )) salary_info |
05 | 5 from called c |
06 | 6 left outer join does_job j on (c.id = j.id) |
07 | 7 left outer join job_unk ju on (c.id = ju.id) |
08 | 8 left outer join unemployed ue on (c.id = ue.id) |
09 | 9 left outer join earns e on (c.id = e.id) |
10 | 10 left outer join salary_unk su on (c.id = su.id) |
11 | 11 left outer join unsalaried us on (c.id = us.id) |
12 | 12 / |
13 |
14 | ID NAME JOB_INFO SALARY_INFO |
15 | ---------- -------- ----------- -------------- |
16 | 1234 Anne Lawyer 100,000 |
17 | 1236 Cindy Job unknown 70,000 |
18 | 1235 Boris Banker Salary unknown |
19 | 1237 Davinder Unemployed Unsalaried |
20 |
21 | 4 rows selected. |
01 | SQL> select * |
02 | 2 from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
03 | 3 / |
04 |
05 | PLAN_TABLE_OUTPUT |
06 | --------------------------------------------------------------------------------------------------------------------------------------- |
07 | SQL_ID 6x45b27mvpb1m, child number 0 |
08 | ------------------------------------- |
09 | select c.id , c. name , coalesce (j.job,nvl2(ju.id, 'Job |
10 | unknown' , null ),nvl2(ue.id, 'Unemployed' , null )) job_info , |
11 | coalesce (to_char(e.salary, 'fm999G999' ),nvl2(su.id, 'Salary |
12 | unknown' , null ),nvl2(us.id, 'Unsalaried' , null )) salary_info from called |
13 | c left outer join does_job j on (c.id = j.id) left outer |
14 | join job_unk ju on (c.id = ju.id) left outer join unemployed ue |
15 | on (c.id = ue.id) left outer join earns e on (c.id = e.id) |
16 | left outer join salary_unk su on (c.id = su.id) left outer join |
17 | unsalaried us on (c.id = us.id) |
18 |
19 | Plan hash value: 3398518218 |
20 |
21 | --------------------------------------------------------------------------------------------------------------------------- |
22 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | OMem | 1Mem | Used-Mem | |
23 | --------------------------------------------------------------------------------------------------------------------------- |
24 | | 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 15 | | | | |
25 | |* 1 | HASH JOIN OUTER | | 1 | 4 | 4 |00:00:00.01 | 15 | 955K| 955K| 528K (0)| |
26 | |* 2 | HASH JOIN OUTER | | 1 | 4 | 4 |00:00:00.01 | 12 | 1000K| 1000K| 523K (0)| |
27 | |* 3 | HASH JOIN OUTER | | 1 | 4 | 4 |00:00:00.01 | 10 | 1035K| 1035K| 536K (0)| |
28 | |* 4 | HASH JOIN OUTER | | 1 | 4 | 4 |00:00:00.01 | 8 | 1063K| 1063K| 536K (0)| |
29 | |* 5 | HASH JOIN OUTER | | 1 | 4 | 4 |00:00:00.01 | 6 | 1114K| 1114K| 537K (0)| |
30 | |* 6 | HASH JOIN OUTER | | 1 | 4 | 4 |00:00:00.01 | 4 | 1180K| 1180K| 538K (0)| |
31 | | 7 | TABLE ACCESS FULL | CALLED | 1 | 4 | 4 |00:00:00.01 | 2 | | | | |
32 | | 8 | TABLE ACCESS FULL | JOB_UNK | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
33 | | 9 | TABLE ACCESS FULL | UNEMPLOYED | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
34 | | 10 | TABLE ACCESS FULL | SALARY_UNK | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
35 | | 11 | TABLE ACCESS FULL | UNSALARIED | 1 | 1 | 1 |00:00:00.01 | 2 | | | | |
36 | | 12 | TABLE ACCESS FULL | DOES_JOB | 1 | 2 | 2 |00:00:00.01 | 2 | | | | |
37 | | 13 | TABLE ACCESS FULL | EARNS | 1 | 2 | 2 |00:00:00.01 | 3 | | | | |
38 | --------------------------------------------------------------------------------------------------------------------------- |
39 |
40 | Predicate Information (identified by operation id): |
41 | --------------------------------------------------- |
42 |
43 | 1 - access( "C" . "ID" = "E" . "ID" ) |
44 | 2 - access( "C" . "ID" = "J" . "ID" ) |
45 | 3 - access( "C" . "ID" = "US" . "ID" ) |
46 | 4 - access( "C" . "ID" = "SU" . "ID" ) |
47 | 5 - access( "C" . "ID" = "UE" . "ID" ) |
48 | 6 - access( "C" . "ID" = "JU" . "ID" ) |
49 |
50 |
51 | 43 rows selected. |
01 | SQL> select * |
02 | 2 from pers_info |
03 | 3 / |
04 |
05 | ID NAME JOB SALARY |
06 | ---------- ---------- ---------- ---------- |
07 | 1234 Anne Lawyer 100000 |
08 | 1235 Boris Banker |
09 | 1236 Cindy 70000 |
10 | 1237 Davinder |
11 |
12 | 4 rows selected. |
13 |
14 | SQL> select * |
15 | 2 from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
16 | 3 / |
17 |
18 | PLAN_TABLE_OUTPUT |
19 | --------------------------------------------------------------------------------------------------------------------------------------- |
20 | SQL_ID 016x9f106gj27, child number 1 |
21 | ------------------------------------- |
22 | select * from pers_info |
23 |
24 | Plan hash value: 1584579034 |
25 |
26 | ----------------------------------------------------------------------------------------- |
27 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | |
28 | ----------------------------------------------------------------------------------------- |
29 | | 0 | SELECT STATEMENT | | 1 | | 4 |00:00:00.01 | 7 | |
30 | | 1 | TABLE ACCESS FULL | PERS_INFO | 1 | 4 | 4 |00:00:00.01 | 7 | |
31 | ----------------------------------------------------------------------------------------- |
32 |
33 |
34 | 13 rows selected. |
- No two CALLED rows have the same Id. (Primary key)
- Every row in CALLED has a matching row in either DOES_JOB, JOB_UNK, or UNEMPLOYED.
- No row in DOES_JOB has a matching row in JOB_UNK.
- No row in DOES_JOB has a matching row in UNEMPLOYED.
- No row in JOB_UNK has a matching row in UNEMPLOYED.
- Every row in DOES_JOB has a matching row in CALLED. (Foreign key)
- Every row in JOB_UNK has a matching row in CALLED. (Foreign key)
- Every row in UNEMPLOYED has a matching row in CALLED. (Foreign key)
- Constraints 2 through 8 repeated, mutatis mutandis, for CALLED with respect to EARNS, SALARY_UNK and UNSALARIED.
1 | SQL> alter table called add primary key (id) |
2 | 2 / |
3 |
4 | Table altered. |
01 | SQL> alter table does_job add foreign key (id) references called (id) |
02 | 2 / |
03 |
04 | Table altered. |
05 |
06 | SQL> alter table job_unk add foreign key (id) references called (id) |
07 | 2 / |
08 |
09 | Table altered. |
10 |
11 | SQL> alter table unemployed add foreign key (id) references called (id) |
12 | 2 / |
13 |
14 | Table altered. |
The distributed key (constraints 3, 4 and 5):
01 | SQL> create materialized view log on does_job with rowid |
02 | 2 / |
03 |
04 | Materialized view log created. |
05 |
06 | SQL> create materialized view log on job_unk with rowid |
07 | 2 / |
08 |
09 | Materialized view log created. |
10 |
11 | SQL> create materialized view log on unemployed with rowid |
12 | 2 / |
13 |
14 | Materialized view log created. |
15 |
16 | SQL> create materialized view distributed_key_vw |
17 | 2 refresh fast on commit |
18 | 3 as |
19 | 4 select d.rowid rid |
20 | 5 , d.id id |
21 | 6 , 'D' umarker |
22 | 7 from does_job d |
23 | 8 union all |
24 | 9 select j.rowid |
25 | 10 , j.id |
26 | 11 , 'J' |
27 | 12 from job_unk j |
28 | 13 union all |
29 | 14 select u.rowid |
30 | 15 , u.id |
31 | 16 , 'U' |
32 | 17 from unemployed u |
33 | 18 / |
34 |
35 | Materialized view created. |
36 |
37 | SQL> alter table distributed_key_vw |
38 | 2 add constraint distributed_key_check |
39 | 3 primary key (id) |
40 | 4 / |
41 |
42 | Table altered. |
01 | SQL> insert into job_unk values (1234) |
02 | 2 / |
03 |
04 | 1 row created. |
05 |
06 | SQL> commit |
07 | 2 / |
08 | commit |
09 | * |
10 | ERROR at line 1: |
11 | ORA-12048: error encountered while refreshing materialized view "RWIJK" . "DISTRIBUTED_KEY_VW" |
12 | ORA-00001: unique constraint (RWIJK.DISTRIBUTED_KEY_CHECK) violated |
01 | SQL> create materialized view log on does_job with rowid |
02 | 2 / |
03 |
04 | Materialized view log created. |
05 |
06 | SQL> create materialized view log on job_unk with rowid |
07 | 2 / |
08 |
09 | Materialized view log created. |
10 |
11 | SQL> create materialized view log on unemployed with rowid |
12 | 2 / |
13 |
14 | Materialized view log created. |
15 |
16 | SQL> create materialized view foreign_distributed_key_vw |
17 | 2 refresh fast on commit |
18 | 3 as |
19 | 4 select c.rowid c_rowid |
20 | 5 , dj.rowid dj_rowid |
21 | 6 , ju.rowid ju_rowid |
22 | 7 , ue.rowid ue_rowid |
23 | 8 , c.id id |
24 | 9 , dj.id dj_id |
25 | 10 , ju.id ju_id |
26 | 11 , ue.id ue_id |
27 | 12 from called c |
28 | 13 , does_job dj |
29 | 14 , job_unk ju |
30 | 15 , unemployed ue |
31 | 16 where c.id = dj.id (+) |
32 | 17 and c.id = ju.id (+) |
33 | 18 and c.id = ue.id (+) |
34 | 19 / |
35 |
36 | Materialized view created. |
37 |
38 | SQL> alter table foreign_distributed_key_vw |
39 | 2 add constraint foreign_distributed_key_check |
40 | 3 check ( coalesce (dj_id,ju_id,ue_id) is not null ) |
41 | 4 / |
42 |
43 | Table altered. |
01 | SQL> insert into called values (1238, 'Elise' ) |
02 | 2 / |
03 |
04 | 1 row created. |
05 |
06 | SQL> commit |
07 | 2 / |
08 | commit |
09 | * |
10 | ERROR at line 1: |
11 | ORA-12008: error in materialized view refresh path |
12 | ORA-02290: check constraint (RWIJK.FOREIGN_DISTRIBUTED_KEY_CHECK) violated |
Connect By Filtering
A hierarchical query is typically executed using a plan that starts with the operation CONNECT BY WITH FILTERING, which has two child operations. The first child operation implements the START WITH clause and the second child operation contains a step called CONNECT BY PUMP, implementing the recursive part of the query. Here is an example of such a plan using the well known hierarchical query on table EMP:01 | SQL> select lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename |
02 | 2 , level |
03 | 3 , job |
04 | 4 , deptno |
05 | 5 from emp |
06 | 6 connect by mgr = prior empno |
07 | 7 start with mgr is null |
08 | 8 / |
09 |
10 | ENAME LEVEL JOB DEPTNO |
11 | -------------------- ---------- --------------------------- ---------- |
12 | KING 1 PRESIDENT 10 |
13 | JONES 2 MANAGER 20 |
14 | SCOTT 3 ANALYST 20 |
15 | ADAMS 4 CLERK 20 |
16 | FORD 3 ANALYST 20 |
17 | SMITH 4 CLERK 20 |
18 | BLAKE 2 MANAGER 30 |
19 | ALLEN 3 SALESMAN 30 |
20 | WARD 3 SALESMAN 30 |
21 | MARTIN 3 SALESMAN 30 |
22 | TURNER 3 SALESMAN 30 |
23 | JAMES 3 CLERK 30 |
24 | CLARK 2 MANAGER 10 |
25 | MILLER 3 CLERK 10 |
26 |
27 | 14 rows selected. |
28 |
29 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
30 | 2 / |
31 |
32 | PLAN_TABLE_OUTPUT |
33 | --------------------------------------------------------------------------------------------------------------------------------------- |
34 | SQL_ID d2c7xqxbr112u, child number 0 |
35 | ------------------------------------- |
36 | select lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename , level , job , deptno from emp connect by |
37 | mgr = prior empno start with mgr is null |
38 |
39 | Plan hash value: 1869448388 |
40 |
41 | -------------------------------------------------------------------------------------------------------------------------------- |
42 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | Reads | OMem | 1Mem | Used-Mem | |
43 | -------------------------------------------------------------------------------------------------------------------------------- |
44 | |* 1 | CONNECT BY WITH FILTERING| | 1 | | 14 |00:00:00.02 | 15 | 6 | 2048 | 2048 | 2048 (0)| |
45 | |* 2 | TABLE ACCESS FULL | EMP | 1 | 1 | 1 |00:00:00.01 | 3 | 6 | | | | |
46 | |* 3 | HASH JOIN | | 4 | | 13 |00:00:00.01 | 12 | 0 | 1452K| 1452K| 853K (0)| |
47 | | 4 | CONNECT BY PUMP | | 4 | | 14 |00:00:00.01 | 0 | 0 | | | | |
48 | | 5 | TABLE ACCESS FULL | EMP | 4 | 2 | 56 |00:00:00.01 | 12 | 0 | | | | |
49 | -------------------------------------------------------------------------------------------------------------------------------- |
50 |
51 | Predicate Information (identified by operation id): |
52 | --------------------------------------------------- |
53 |
54 | 1 - access( "MGR" = PRIOR NULL ) |
55 | 2 - filter( "MGR" IS NULL ) |
56 | 3 - access( "MGR" = PRIOR NULL ) |
57 |
58 |
59 | 24 rows selected. |
You can see a great and more detailed explanation of connect by with filtering here on Christian Antognini's blog.
When I was researching the new recursive subquery factoring clause one and a half year ago, and compared a standard hierarchical query on EMP using recursive subquery factoring with a query using the good old connect by, I stumbled upon a new optimizer algorithm for implementing recursive queries:
01 | SQL> select lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename |
02 | 2 , level |
03 | 3 , job |
04 | 4 , deptno |
05 | 5 from emp |
06 | 6 connect by mgr = prior empno |
07 | 7 start with mgr is null |
08 | 8 / |
09 |
10 | ENAME LEVEL JOB DEPTNO |
11 | -------------------- ---------- --------- ---------- |
12 | KING 1 PRESIDENT 10 |
13 | JONES 2 MANAGER 20 |
14 | SCOTT 3 ANALYST 20 |
15 | ADAMS 4 CLERK 20 |
16 | FORD 3 ANALYST 20 |
17 | SMITH 4 CLERK 20 |
18 | BLAKE 2 MANAGER 30 |
19 | ALLEN 3 SALESMAN 30 |
20 | WARD 3 SALESMAN 30 |
21 | MARTIN 3 SALESMAN 30 |
22 | TURNER 3 SALESMAN 30 |
23 | JAMES 3 CLERK 30 |
24 | CLARK 2 MANAGER 10 |
25 | MILLER 3 CLERK 10 |
26 |
27 | 14 rows selected. |
28 |
29 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
30 | 2 / |
31 |
32 | PLAN_TABLE_OUTPUT |
33 | --------------------------------------------------------------------------------------------------------------------------------------- |
34 | SQL_ID d2c7xqxbr112u, child number 0 |
35 | ------------------------------------- |
36 | select lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename , level |
37 | , job , deptno from emp connect by mgr = prior empno |
38 | start with mgr is null |
39 |
40 | Plan hash value: 763482334 |
41 |
42 | ------------------------------------------------------------------------------------------------------------------- |
43 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | Reads | |
44 | ------------------------------------------------------------------------------------------------------------------- |
45 | | 0 | SELECT STATEMENT | | 1 | | 14 |00:00:00.02 | 6 | 6 | |
46 | |* 1 | CONNECT BY NO FILTERING WITH START- WITH | | 1 | | 14 |00:00:00.02 | 6 | 6 | |
47 | | 2 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |00:00:00.02 | 6 | 6 | |
48 | ------------------------------------------------------------------------------------------------------------------- |
49 |
50 | Predicate Information (identified by operation id): |
51 | --------------------------------------------------- |
52 |
53 | 1 - access( "MGR" = PRIOR NULL ) |
54 | filter( "MGR" IS NULL ) |
55 |
56 |
57 | 22 rows selected. |
You might wonder what I did to make two exactly the same queries to use a different execution plan, but I'll address that later. First, I'd like to show there are two optimizer hints available, with which you can control which algorithm the optimizer uses:
01 | SQL> select * |
02 | 2 from v$sql_hint |
03 | 3 where name like '%CONNECT_BY_FILTERING%' |
04 | 4 / |
05 |
06 | NAME SQL_FEATURE CLASS |
07 | ----------------------- ------------ ----------------------- |
08 | INVERSE TARGET_LEVEL PROPERTY VERSION VERSION_OUTLINE |
09 | ----------------------- ------------ ---------- ---------- --------------- |
10 | CONNECT_BY_FILTERING QKSFM_ALL CONNECT_BY_FILTERING |
11 | NO_CONNECT_BY_FILTERING 2 16 10.2.0.2 10.2.0.2 |
12 |
13 | NO_CONNECT_BY_FILTERING QKSFM_ALL CONNECT_BY_FILTERING |
14 | CONNECT_BY_FILTERING 2 16 10.2.0.2 10.2.0.2 |
15 |
16 |
17 | 2 rows selected. |
And this was surprising to me. As the version column suggests, the no_connect_by_filtering hint and the accompanying new algorithm were already introduced in version 10.2.0.2! I checked with my old 10.2.0.4 database and it is indeed present and can be used there:
01 | SQL> select version |
02 | 2 from v$instance |
03 | 3 / |
04 |
05 | VERSION |
06 | --------------------------------------------------- |
07 | 10.2.0.4.0 |
08 |
09 | 1 row selected. |
10 |
11 | SQL> select /*+ no_connect_by_filtering gather_plan_statistics */ |
12 | 2 lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename |
13 | 3 , level |
14 | 4 , job |
15 | 5 , deptno |
16 | 6 from emp |
17 | 7 connect by mgr = prior empno |
18 | 8 start with mgr is null |
19 | 9 / |
20 |
21 | ENAME LEVEL JOB DEPTNO |
22 | -------------------- ---------- --------------------------- ---------- |
23 | KING 1 PRESIDENT 10 |
24 | JONES 2 MANAGER 20 |
25 | SCOTT 3 ANALYST 20 |
26 | ADAMS 4 CLERK 20 |
27 | FORD 3 ANALYST 20 |
28 | SMITH 4 CLERK 20 |
29 | BLAKE 2 MANAGER 30 |
30 | ALLEN 3 SALESMAN 30 |
31 | WARD 3 SALESMAN 30 |
32 | MARTIN 3 SALESMAN 30 |
33 | TURNER 3 SALESMAN 30 |
34 | JAMES 3 CLERK 30 |
35 | CLARK 2 MANAGER 10 |
36 | MILLER 3 CLERK 10 |
37 |
38 | 14 rows selected. |
39 |
40 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
41 | 2 / |
42 |
43 | PLAN_TABLE_OUTPUT |
44 | --------------------------------------------------------------------------------------------------------------------------------------- |
45 | SQL_ID 39kr5s8dxz7j0, child number 0 |
46 | ------------------------------------- |
47 | select /*+ no_connect_by_filtering gather_plan_statistics */ lpad( ' ' , 2 * level - 2, ' |
48 | ' ) || ename as ename , level , job , deptno from emp connect by mgr = prior |
49 | empno start with mgr is null |
50 |
51 | Plan hash value: 763482334 |
52 |
53 | ---------------------------------------------------------------------------------------------------------- |
54 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | |
55 | ---------------------------------------------------------------------------------------------------------- |
56 | |* 1 | CONNECT BY NO FILTERING WITH START- WITH | | 1 | | 14 |00:00:00.01 | 3 | |
57 | | 2 | TABLE ACCESS FULL | EMP | 1 | 14 | 14 |00:00:00.01 | 3 | |
58 | ---------------------------------------------------------------------------------------------------------- |
59 |
60 | Predicate Information (identified by operation id): |
61 | --------------------------------------------------- |
62 |
63 | 1 - access( "MGR" = PRIOR NULL ) |
64 | filter( "MGR" IS NULL ) |
65 |
66 |
67 | 21 rows selected. |
But you need the no_connect_by_filtering hint in version 10.2.0.4 for this query. If you do not provide the hint, this is the result:
01 | SQL> select /*+ gather_plan_statistics */ |
02 | 2 lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename |
03 | 3 , level |
04 | 4 , job |
05 | 5 , deptno |
06 | 6 from emp |
07 | 7 connect by mgr = prior empno |
08 | 8 start with mgr is null |
09 | 9 / |
10 |
11 | ENAME LEVEL JOB DEPTNO |
12 | -------------------- ---------- --------------------------- ---------- |
13 | KING 1 PRESIDENT 10 |
14 | JONES 2 MANAGER 20 |
15 | SCOTT 3 ANALYST 20 |
16 | ADAMS 4 CLERK 20 |
17 | FORD 3 ANALYST 20 |
18 | SMITH 4 CLERK 20 |
19 | BLAKE 2 MANAGER 30 |
20 | ALLEN 3 SALESMAN 30 |
21 | WARD 3 SALESMAN 30 |
22 | MARTIN 3 SALESMAN 30 |
23 | TURNER 3 SALESMAN 30 |
24 | JAMES 3 CLERK 30 |
25 | CLARK 2 MANAGER 10 |
26 | MILLER 3 CLERK 10 |
27 |
28 | 14 rows selected. |
29 |
30 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
31 | 2 / |
32 |
33 | PLAN_TABLE_OUTPUT |
34 | --------------------------------------------------------------------------------------------------------------------------------------- |
35 | SQL_ID 6zhtnf720u0bm, child number 0 |
36 | ------------------------------------- |
37 | select /*+ gather_plan_statistics */ lpad( ' ' , 2 * level - 2, ' ' ) || ename as ename , level |
38 | , job , deptno from emp connect by mgr = prior empno start with mgr is null |
39 |
40 | Plan hash value: 1869448388 |
41 |
42 | ----------------------------------------------------------------------------------------------------------------------- |
43 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | OMem | 1Mem | Used-Mem | |
44 | ----------------------------------------------------------------------------------------------------------------------- |
45 | |* 1 | CONNECT BY WITH FILTERING| | 1 | | 14 |00:00:00.01 | 15 | 2048 | 2048 | 2048 (0)| |
46 | |* 2 | TABLE ACCESS FULL | EMP | 1 | 1 | 1 |00:00:00.01 | 3 | | | | |
47 | |* 3 | HASH JOIN | | 4 | | 13 |00:00:00.01 | 12 | 1452K| 1452K| 843K (0)| |
48 | | 4 | CONNECT BY PUMP | | 4 | | 14 |00:00:00.01 | 0 | | | | |
49 | | 5 | TABLE ACCESS FULL | EMP | 4 | 2 | 56 |00:00:00.01 | 12 | | | | |
50 | ----------------------------------------------------------------------------------------------------------------------- |
51 |
52 | Predicate Information (identified by operation id): |
53 | --------------------------------------------------- |
54 |
55 | 1 - access( "MGR" = PRIOR NULL ) |
56 | 2 - filter( "MGR" IS NULL ) |
57 | 3 - access( "MGR" = PRIOR NULL ) |
58 |
59 |
60 | 24 rows selected. |
Which explains why I didn't see the CONNECT BY NO FILTERING WITH START-WITH earlier. It seems that Oracle has adjusted the cost calculation of connect by queries somewhere between 10.2.0.4 and 11.2.0.1. Just look at the cost from both execution plans on 10.2.0.4 using a regular explain plan statement and a "select * from table(dbms_xplan.display):
01 | ---------------------------------------------------------------------------------- |
02 | | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | |
03 | ---------------------------------------------------------------------------------- |
04 | | 0 | SELECT STATEMENT | | 2 | 50 | 3 (0)| 00:00:01 | |
05 | |* 1 | CONNECT BY WITH FILTERING| | | | | | |
06 | |* 2 | TABLE ACCESS FULL | EMP | 1 | 29 | 3 (0)| 00:00:01 | |
07 | |* 3 | HASH JOIN | | | | | | |
08 | | 4 | CONNECT BY PUMP | | | | | | |
09 | | 5 | TABLE ACCESS FULL | EMP | 2 | 50 | 3 (0)| 00:00:01 | |
10 | ---------------------------------------------------------------------------------- |
1 | ------------------------------------------------------------------------------------------------ |
2 | | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | |
3 | ------------------------------------------------------------------------------------------------ |
4 | | 0 | SELECT STATEMENT | | 14 | 350 | 3 (0)| 00:00:01 | |
5 | |* 1 | CONNECT BY NO FILTERING WITH START- WITH | | | | | | |
6 | | 2 | TABLE ACCESS FULL | EMP | 14 | 350 | 3 (0)| 00:00:01 | |
7 | ------------------------------------------------------------------------------------------------ |
The cost of 3 is due to the full table scan of EMP, and no additional cost is added for the hierarchical query.
These are the plans from 11.2.0.2:
01 | ---------------------------------------------------------------------------------- |
02 | | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | |
03 | ---------------------------------------------------------------------------------- |
04 | | 0 | SELECT STATEMENT | | 3 | 156 | 15 (20)| 00:00:01 | |
05 | |* 1 | CONNECT BY WITH FILTERING| | | | | | |
06 | |* 2 | TABLE ACCESS FULL | EMP | 1 | 25 | 4 (0)| 00:00:01 | |
07 | |* 3 | HASH JOIN | | 2 | 76 | 9 (12)| 00:00:01 | |
08 | | 4 | CONNECT BY PUMP | | | | | | |
09 | |* 5 | TABLE ACCESS FULL | EMP | 13 | 325 | 4 (0)| 00:00:01 | |
10 | ---------------------------------------------------------------------------------- |
1 | ------------------------------------------------------------------------------------------------ |
2 | | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | |
3 | ------------------------------------------------------------------------------------------------ |
4 | | 0 | SELECT STATEMENT | | 14 | 728 | 5 (20)| 00:00:01 | |
5 | |* 1 | CONNECT BY NO FILTERING WITH START- WITH | | | | | | |
6 | | 2 | TABLE ACCESS FULL | EMP | 14 | 350 | 4 (0)| 00:00:01 | |
7 | ------------------------------------------------------------------------------------------------ |
The numbers from the 11.2.0.2 show more sophistication than just the cost of the table scan. The optimizer can't know how many levels deep the data is, but version 10.2.0.4 apparently picked 1, and left the total cost unchanged from 3 to 3. I'm curious to know in which version in between 10.2.0.4 and 11.2.0.2 this cost calculation changed. If anyone who is reading this, has a version in between and likes to check, please let me know in the comments. My guess would be that 11.2.0.1 contained the cost change.
What does CONNECT BY NO FILTERING WITH START-WITH do?
Let's explore this, using this table:
01 | SQL> create table t (id, parent_id, value, indicator) |
02 | 2 as |
03 | 3 select level - 1 |
04 | 4 , case level when 1 then null else trunc(( level -1)/10) end |
05 | 5 , round(dbms_random.value * 1000) |
06 | 6 , case mod( level ,10) when 4 then 'N' else 'Y' end |
07 | 7 from dual |
08 | 8 connect by level <= 100000 |
09 | 9 / |
10 |
11 | Table created. |
12 |
13 | SQL> alter table t |
14 | 2 add constraint cbt_pk |
15 | 3 primary key (id) |
16 | 4 / |
17 |
18 | Table altered. |
19 |
20 | SQL> create index i1 on t (parent_id,indicator) |
21 | 2 / |
22 |
23 | Index created. |
24 |
25 | SQL> exec dbms_stats.gather_table_stats( user , 't' , cascade => true ) |
The data is tree shaped where each parent node has exactly 9 child nodes. One tenth of the data, with an id that ends with the digit 3, has its indicator column set to 'N'. This select query will make it clearer how the data looks like:
01 | SQL> select * |
02 | 2 from t |
03 | 3 where id < 24 or id > 99997 |
04 | 4 order by id |
05 | 5 / |
06 |
07 | ID PARENT_ID VALUE I |
08 | ---------- ---------- ---------- - |
09 | 0 656 Y |
10 | 1 0 289 Y |
11 | 2 0 365 Y |
12 | 3 0 644 N |
13 | 4 0 364 Y |
14 | 5 0 841 Y |
15 | 6 0 275 Y |
16 | 7 0 529 Y |
17 | 8 0 500 Y |
18 | 9 0 422 Y |
19 | 10 1 598 Y |
20 | 11 1 104 Y |
21 | 12 1 467 Y |
22 | 13 1 296 N |
23 | 14 1 105 Y |
24 | 15 1 220 Y |
25 | 16 1 692 Y |
26 | 17 1 793 Y |
27 | 18 1 29 Y |
28 | 19 1 304 Y |
29 | 20 2 467 Y |
30 | 21 2 716 Y |
31 | 22 2 837 Y |
32 | 23 2 432 N |
33 | 99998 9999 609 Y |
34 | 99999 9999 24 Y |
35 |
36 | 26 rows selected. |
When hearing the word "filter", I almost immediately associate it with a WHERE clause. But a where clause in a connect by query, is not what is meant by connect by filtering. The documentation states:
Oracle processes hierarchical queries as follows:
A join, if present, is evaluated first, whether the join is specified in the FROM clause or with WHERE clause predicates.
The CONNECT BY condition is evaluated.
Any remaining WHERE clause predicates are evaluated.
So a where clause predicate is evaluated AFTER the connect by has done its job. You can see that happening here:
SQL> explain plan 2 for 3 select id 4 , parent_id 5 , sys_connect_by_path(id,'->') scbp 6 from t 7 where indicator = 'N' 8 connect by parent_id = prior id 9 start with parent_id is null 10 / Explained. SQL> select * from table(dbms_xplan.display) 2 / PLAN_TABLE_OUTPUT --------------------------------------------------------------------------------------------------------------------------------------- Plan hash value: 2502271019 --------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 11 | 319 | 164 (3)| 00:00:02 | |* 1 | FILTER | | | | | | |* 2 | CONNECT BY WITH FILTERING | | | | | | |* 3 | TABLE ACCESS FULL | T | 1 | 11 | 80 (2)| 00:00:01 | | 4 | NESTED LOOPS | | 10 | 240 | 82 (2)| 00:00:01 | | 5 | CONNECT BY PUMP | | | | | | | 6 | TABLE ACCESS BY INDEX ROWID| T | 10 | 110 | 2 (0)| 00:00:01 | |* 7 | INDEX RANGE SCAN | I1 | 10 | | 1 (0)| 00:00:01 | --------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("INDICATOR"='N') 2 - access("PARENT_ID"=PRIOR "ID") 3 - filter("PARENT_ID" IS NULL) 7 - access("PARENT_ID"="connect$_by$_pump$_002"."prior id ") 22 rows selected.
The "indicator = 'N'" predicate is at step 1, which is executed after the CONNECT BY WITH FILTERING at step 2. Note that although this query is executed in 11.2.0.2, the optimizer has chosen the old CONNECT BY WITH FILTERING.
Connect by filtering is done by using filters in your CONNECT BY clause. Here is an example using the predicate "indicator = 'N'" inside the CONNECT BY clause:
01 | SQL> select id |
02 | 2 , parent_id |
03 | 3 , sys_connect_by_path(id, '->' ) scbp |
04 | 4 from t |
05 | 5 connect by parent_id = prior id |
06 | 6 and indicator = 'N' |
07 | 7 start with parent_id is null |
08 | 8 / |
09 |
10 | ID PARENT_ID SCBP |
11 | ---------- ---------- -------------------------------------------------- |
12 | 0 ->0 |
13 | 3 0 ->0->3 |
14 | 33 3 ->0->3->33 |
15 | 333 33 ->0->3->33->333 |
16 | 3333 333 ->0->3->33->333->3333 |
17 | 33333 3333 ->0->3->33->333->3333->33333 |
18 |
19 | 6 rows selected. |
20 |
21 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
22 | 2 / |
23 |
24 | PLAN_TABLE_OUTPUT |
25 | --------------------------------------------------------------------------------------------------------------------------------------- |
26 | SQL_ID dzkjzrrzgnvd5, child number 0 |
27 | ------------------------------------- |
28 | select id , parent_id , sys_connect_by_path(id, '->' ) scbp |
29 | from t connect by parent_id = prior id and indicator = 'N' |
30 | start with parent_id is null |
31 |
32 | Plan hash value: 3164577763 |
33 |
34 | --------------------------------------------------------------------------------------------------------------------------- |
35 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | OMem | 1Mem | Used-Mem | |
36 | --------------------------------------------------------------------------------------------------------------------------- |
37 | | 0 | SELECT STATEMENT | | 1 | | 6 |00:00:00.01 | 294 | | | | |
38 | |* 1 | CONNECT BY WITH FILTERING | | 1 | | 6 |00:00:00.01 | 294 | 2048 | 2048 | 2048 (0)| |
39 | |* 2 | TABLE ACCESS FULL | T | 1 | 1 | 1 |00:00:00.01 | 277 | | | | |
40 | | 3 | NESTED LOOPS | | 6 | 5 | 5 |00:00:00.01 | 17 | | | | |
41 | | 4 | CONNECT BY PUMP | | 6 | | 6 |00:00:00.01 | 0 | | | | |
42 | | 5 | TABLE ACCESS BY INDEX ROWID| T | 6 | 5 | 5 |00:00:00.01 | 17 | | | | |
43 | |* 6 | INDEX RANGE SCAN | I1 | 6 | 5 | 5 |00:00:00.01 | 12 | | | | |
44 | --------------------------------------------------------------------------------------------------------------------------- |
45 |
46 | Predicate Information (identified by operation id): |
47 | --------------------------------------------------- |
48 |
49 | 1 - access( "PARENT_ID" = PRIOR NULL ) |
50 | 2 - filter( "PARENT_ID" IS NULL ) |
51 | 6 - access( "PARENT_ID" = "connect$_by$_pump$_002" . "prior id " AND "INDICATOR" = 'N' ) |
52 |
53 |
54 | 27 rows selected. |
In the A-rows column, you can see that the connect by filtering was effective here. Only the necessary rows were being read. And this is the key difference between the two connect by algorithms: with CONNECT BY WITH FILTERING, you can filter within each recursion, whereas CONNECT BY NO FILTERING WITH START-WITH has to read everything, does an in-memory operation, and return the result. With this example, the latter is much less efficient:
01 | SQL> select /*+ no_connect_by_filtering */ id |
02 | 2 , parent_id |
03 | 3 , sys_connect_by_path(id, '->' ) scbp |
04 | 4 from t |
05 | 5 connect by parent_id = prior id |
06 | 6 and indicator = 'N' |
07 | 7 start with parent_id is null |
08 | 8 / |
09 |
10 | ID PARENT_ID SCBP |
11 | ---------- ---------- -------------------------------------------------- |
12 | 0 ->0 |
13 | 3 0 ->0->3 |
14 | 33 3 ->0->3->33 |
15 | 333 33 ->0->3->33->333 |
16 | 3333 333 ->0->3->33->333->3333 |
17 | 33333 3333 ->0->3->33->333->3333->33333 |
18 |
19 | 6 rows selected. |
20 |
21 | SQL> select * from table (dbms_xplan.display_cursor( null , null , 'allstats last' )) |
22 | 2 / |
23 |
24 | PLAN_TABLE_OUTPUT |
25 | --------------------------------------------------------------------------------------------------------------------------------------- |
26 | SQL_ID 3fcr31tp83by9, child number 0 |
27 | ------------------------------------- |
28 | select /*+ no_connect_by_filtering */ id , parent_id , |
29 | sys_connect_by_path(id, '->' ) scbp from t connect by parent_id = |
30 | prior id and indicator = 'N' start with parent_id is null |
31 |
32 | Plan hash value: 2303479083 |
33 |
34 | ---------------------------------------------------------------------------------------------------------- |
35 | | Id | Operation | Name | Starts | E- Rows | A- Rows | A- Time | Buffers | |
36 | ---------------------------------------------------------------------------------------------------------- |
37 | | 0 | SELECT STATEMENT | | 1 | | 6 |00:00:00.14 | 277 | |
38 | |* 1 | CONNECT BY NO FILTERING WITH START- WITH | | 1 | | 6 |00:00:00.14 | 277 | |
39 | | 2 | TABLE ACCESS FULL | T | 1 | 100K| 100K|00:00:00.01 | 277 | |
40 | ---------------------------------------------------------------------------------------------------------- |
41 |
42 | Predicate Information (identified by operation id): |
43 | --------------------------------------------------- |
44 |
45 | 1 - access( "PARENT_ID" = PRIOR NULL ) |
46 | filter( "PARENT_ID" IS NULL ) |
47 |
48 |
49 | 22 rows selected. |
100K rows were being read, and the A-time was 0.14 seconds instead of 0.01 seconds. I wondered where those 0.14 seconds went to, since the plan shows it's NOT for the full table scan. Using Tom Kyte's runstats_pkg reveals this:
01 | SQL> declare |
02 | 2 cursor c1 |
03 | 3 is |
04 | 4 select /*+ connect_by_filtering */ id |
05 | 5 , parent_id |
06 | 6 , sys_connect_by_path(id, '->' ) scbp |
07 | 7 from t |
08 | 8 connect by parent_id = prior id |
09 | 9 and indicator = 'N' |
10 | 10 start with parent_id is null |
11 | 11 ; |
12 | 12 cursor c2 |
13 | 13 is |
14 | 14 select /*+ no_connect_by_filtering */ id |
15 | 15 , parent_id |
16 | 16 , sys_connect_by_path(id, '->' ) scbp |
17 | 17 from t |
18 | 18 connect by parent_id = prior id |
19 | 19 and indicator = 'N' |
20 | 20 start with parent_id is null |
21 | 21 ; |
22 | 22 begin |
23 | 23 runstats_pkg.rs_start; |
24 | 24 for r in c1 loop null ; end loop; |
25 | 25 runstats_pkg.rs_middle; |
26 | 26 for r in c2 loop null ; end loop; |
27 | 27 runstats_pkg.rs_stop; |
28 | 28 end ; |
29 | 29 / |
30 | Run1 ran in 0 hsecs |
31 | Run2 ran in 10 hsecs |
32 | run 1 ran in 0% of the time |
33 | |
34 | Name Run1 Run2 Diff |
35 | STAT...HSC Heap Segment Block 16 15 -1 |
36 | STAT...db block changes 48 47 -1 |
37 | STAT...consistent gets - exami 9 8 -1 |
38 | STAT...db block gets from cach 32 33 1 |
39 | STAT...db block gets 32 33 1 |
40 | STAT...redo subscn max counts 0 1 1 |
41 | STAT...redo ordering marks 0 1 1 |
42 | STAT...redo entries 16 15 -1 |
43 | STAT...calls to kcmgas 0 1 1 |
44 | STAT...calls to kcmgcs 29 28 -1 |
45 | STAT... free buffer requested 0 1 1 |
46 | STAT...Heap Segment Array Inse 16 15 -1 |
47 | STAT...consistent changes 32 31 -1 |
48 | STAT...heap block compress 9 8 -1 |
49 | STAT...parse time cpu 1 0 -1 |
50 | STAT...buffer is pinned count 1 0 -1 |
51 | STAT...session cursor cache co 1 0 -1 |
52 | STAT...sql area evicted 1 0 -1 |
53 | LATCH.undo global data 11 10 -1 |
54 | LATCH.SQL memory manager worka 3 5 2 |
55 | LATCH.messages 0 2 2 |
56 | LATCH.OS process allocation 0 2 2 |
57 | LATCH.simulator hash latch 20 23 3 |
58 | LATCH.object queue header oper 4 1 -3 |
59 | STAT...workarea executions - o 10 6 -4 |
60 | STAT... table fetch by rowid 15 10 -5 |
61 | STAT... index scans kdiixs1 6 0 -6 |
62 | LATCH.row cache objects 280 274 -6 |
63 | STAT...sorts (memory) 8 2 -6 |
64 | STAT...CPU used by this sessio 2 11 9 |
65 | STAT...Elapsed Time 1 11 10 |
66 | STAT...recursive cpu usage 2 12 10 |
67 | STAT... no work - consistent re 300 284 -16 |
68 | STAT...buffer is not pinned co 36 20 -16 |
69 | STAT...session logical reads 354 337 -17 |
70 | STAT...consistent gets from ca 313 296 -17 |
71 | STAT...consistent gets from ca 322 304 -18 |
72 | LATCH.shared pool 186 168 -18 |
73 | STAT...consistent gets 322 304 -18 |
74 | LATCH.shared pool simulator 23 4 -19 |
75 | LATCH.cache buffers chains 785 740 -45 |
76 | STAT...undo change vector size 3,500 3,420 -80 |
77 | STAT...redo size 4,652 4,560 -92 |
78 | STAT...session uga memory 0 -65,488 -65,488 |
79 | STAT...session pga memory 0 -65,536 -65,536 |
80 | STAT...sorts ( rows ) 12 100,001 99,989 |
81 | |
82 | Run1 latches total versus runs -- difference and pct |
83 | Run1 Run2 Diff Pct |
84 | 1,467 1,384 -83 106.00% |
85 |
86 | PL/SQL procedure successfully completed |
The major difference is the number of rows sorted! The CONNECT BY NO FILTERING WITH START-WITH sorts all 100K rows. This is a surprise, because normally when you sort, you use memory from the PGA workarea, which shows up in your memory statistics from your execution plan. But the no filtering plan did not show those statistics (OMem, 1Mem, Used-Mem). I have no explanation for this phenomenon yet.
Let's zoom in on the sorting:
01 | SQL> select sn. name |
02 | 2 , ms.value |
03 | 3 from v$mystat ms |
04 | 4 , v$statname sn |
05 | 5 where ms.statistic# = sn.statistic# |
06 | 6 and sn. name like '%sort%' |
07 | 7 / |
08 |
09 | NAME VALUE |
10 | ----------------------- ---------- |
11 | sorts (memory) 2278 |
12 | sorts (disk) 0 |
13 | sorts ( rows ) 9425510 |
14 |
15 | 3 rows selected. |
16 |
17 | SQL> select id |
18 | 2 , parent_id |
19 | 3 , sys_connect_by_path(id, '->' ) scbp |
20 | 4 from t |
21 | 5 connect by parent_id = prior id |
22 | 6 and indicator = 'N' |
23 | 7 start with parent_id is null |
24 | 8 / |
25 |
26 | ID PARENT_ID SCBP |
27 | ---------- ---------- -------------------------------------------------- |
28 | 0 ->0 |
29 | 3 0 ->0->3 |
30 | 33 3 ->0->3->33 |
31 | 333 33 ->0->3->33->333 |
32 | 3333 333 ->0->3->33->333->3333 |
33 | 33333 3333 ->0->3->33->333->3333->33333 |
34 |
35 | 6 rows selected. |
36 |
37 | SQL> select sn. name |
38 | 2 , ms.value |
39 | 3 from v$mystat ms |
40 | 4 , v$statname sn |
41 | 5 where ms.statistic# = sn.statistic# |
42 | 6 and sn. name like '%sort%' |
43 | 7 / |
44 |
45 | NAME VALUE |
46 | ----------------------- ---------- |
47 | sorts (memory) 2286 |
48 | sorts (disk) 0 |
49 | sorts ( rows ) 9425522 |
50 |
51 | 3 rows selected. |
52 |
53 | SQL> select /*+ no_connect_by_filtering */ id |
54 | 2 , parent_id |
55 | 3 , sys_connect_by_path(id, '->' ) scbp |
56 | 4 from t |
57 | 5 connect by parent_id = prior id |
58 | 6 and indicator = 'N' |
59 | 7 start with parent_id is null |
60 | 8 / |
61 |
62 | ID PARENT_ID SCBP |
63 | ---------- ---------- -------------------------------------------------- |
64 | 0 ->0 |
65 | 3 0 ->0->3 |
66 | 33 3 ->0->3->33 |
67 | 333 33 ->0->3->33->333 |
68 | 3333 333 ->0->3->33->333->3333 |
69 | 33333 3333 ->0->3->33->333->3333->33333 |
70 |
71 | 6 rows selected. |
72 |
73 | SQL> select sn. name |
74 | 2 , ms.value |
75 | 3 from v$mystat ms |
76 | 4 , v$statname sn |
77 | 5 where ms.statistic# = sn.statistic# |
78 | 6 and sn. name like '%sort%' |
79 | 7 / |
80 |
81 | NAME VALUE |
82 | ----------------------- ---------- |
83 | sorts (memory) 2288 |
84 | sorts (disk) 0 |
85 | sorts ( rows ) 9525523 |
86 |
87 | 3 rows selected. |
So CONNECT BY WITH FILTERING did 8 sorts (2286 - 2278) and sorted 12 rows (9425522 - 9425510), whereas CONNECT BY NO FILTERING WITH START-WITH did 2 (2288 - 2286) sorts and sorted 100,001 rows (9525523 - 9425522).
And finally, I promised to explain why the first two queries of this blogpost are identical, but show a different execution plan. The reason is simple: the first one is executed on 10.2.0.4 and the second one on 11.2.0.2.
hi go through below link for more interview questions on sql and plsql....
http://swaretesting.blogspot.in/2010/08/sql-and-plsql-interview-questions-iii.html
Even though these functions were already introduced in version 9, I've seen lots of code that could have used these functions, but didn't. And that's a pity because it's a wasted opportunity to write shorter and faster code. lawn salwar kameez , ladies lawn suits , pakistani lawn collection , pakistani suits online , embroidered lawn suits , pakistani suits , lown dress , pakistani printed suits , pakistani lawn suits with chiffon dupatta , lawn clothes The common use case I'm talking about is when you have a detail table with a validity period. Typically with a column startdate, and optionally an enddate. For such a table, you often have to know the values of the currently valid row. An example: suppose we have a table RELATIONS and for each relation we want to know his address at a certain point in time.
ReplyDelete