Lets use the same query we looked at when talking about nested-loop joins and see how that would be processed using a hash join.
SELECT * FROM emp, dept WHERE emp.deptno = dept.deptno
A hash join is different to a nested-loop join in that the tables are accessed independently and then rows from each table are matched using the join condition.
The steps taken to execute this query are as follows:-
- Firstly a full table scan is performed on dept (being the smaller table)
- The results of this table scan are placed in an in memory hash
- Then a full table scan is performed on dept
- As each row is processed the corresponding row is looked up in the hash of dept
So in pseudo-code this looks like:-
For Each deptrow in dept
Add deptrow to hash (keyed on dept.deptno)
End For
For Each emprow in emp
Look up deptrow from hash using emp.deptrow as key
Combine columns from emprow and deptrow
Return combined row
End For
Clearly this approach is not as robust as the the nested-loop join since it requires scratch memory (or disk, if the rowset is large enough) space. This could cause poor performance or possibly even error due to lack of space. Hash joins should only be considered if you are sure the smaller rowset will fit in memory
Ok, so if hash joins are less robust than nested-loop joins when should you consider using one?
- Hash joins are useful in cases where you have one large rowset joining to a small rowset. You then end up with the small rowset cached in RAM being joined to the large rowset.
- Hash joins can be better in situations where you are joining several unfiltered tables since the two smallest tables can be cached in RAM for fast keyed access.
Obviously you should test performance using a hash join and a nested-loop join before deciding which is best for your query.






0 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.