Sort-merge joins work in a similar way to hash joins in that each table is accessed independently. This time instead of using a hash the two rowsets are sorted on the join key. Once both rowsets are sorted they are merged on the join key using a merge join algorithm.
Sort-merge joins have the same advantages as hash joins but are far less robust since both rowsets need to be loaded into memory and sorted before the join can occur. You should only use this join method if a hash join is not available (Oracle databases using the Rules Based Optiomiser)
Uncategorized
/Databases
Lets use the same query we looked at when talking about nested-loop joins and see how that would be processed using a hash join.
SELECT * FROM emp, dept WHERE emp.deptno = dept.deptno
A hash join is different to a nested-loop join in that the tables are accessed independently and then rows from each table are matched using the join condition.
The steps taken to execute this query are as follows:-
- Firstly a full table scan is performed on dept (being the smaller table)
- The results of this table scan are placed in an in memory hash
- Then a full table scan is performed on dept
- As each row is processed the corresponding row is looked up in the hash of dept
So in pseudo-code this looks like:-
For Each deptrow in dept
Add deptrow to hash (keyed on dept.deptno)
End For
For Each emprow in emp
Look up deptrow from hash using emp.deptrow as key
Combine columns from emprow and deptrow
Return combined row
End For
Clearly this approach is not as robust as the the nested-loop join since it requires scratch memory (or disk, if the rowset is large enough) space. This could cause poor performance or possibly even error due to lack of space. Hash joins should only be considered if you are sure the smaller rowset will fit in memory
Ok, so if hash joins are less robust than nested-loop joins when should you consider using one?
- Hash joins are useful in cases where you have one large rowset joining to a small rowset. You then end up with the small rowset cached in RAM being joined to the large rowset.
- Hash joins can be better in situations where you are joining several unfiltered tables since the two smallest tables can be cached in RAM for fast keyed access.
Obviously you should test performance using a hash join and a nested-loop join before deciding which is best for your query.
Uncategorized
/Databases
Recent Comments