Using the union methods in database queries

From Joomla! Documentation

Revision as of 16:49, 8 February 2014 by Chris Davenport (talk | contribs) (First attempt at documenting the union database methods.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Using a UNION in a database query is a useful way to combine the results of two or more database SELECT queries that are not necessarily linked by a database relationship. It can also be a useful performance optimisation. Using a UNION to combine the results of different queries can sometimes be significantly faster than a single query with a WHERE clause especially when the query involves joins to other large tables.

For those familiar with a little set theory a UNION does exactly what you would expect; it merges the set of results from one query with the set of results from another query to produce a set of results which is the set union of both of the individual result sets. If you want the set to be ordered then you will need to pay some particular attention to the way that is done, as will be explained later.

The basics[edit]

In order to use a UNION you need to be aware of the basic requirements established by the SQL server you are using. These are not enforced by Joomla, but if you don't comply with them you will get database errors. In particular, each SELECT query must return the same number of fields in the same order and with compatible data types in order for the UNION to be successful.

A simple example[edit]

Suppose you have want to send out a mailshot to a group of people but the database is such that the names and email addresses you want to send to are not all in the same table. To invent an arbitrary example, suppose that you want to send the mail to all customers and all suppliers and that that the names and email addresses are stored in tables called, unsurprisingly, customers and suppliers, respectively.

This query will retrieve all the customer information that we need for the mailshot:

$query
    ->select('name, email')
    ->from('customers')
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

while this query will do the same for all suppliers:

$query
    ->select('name, email')
    ->from('suppliers')
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

Then you can combine the results into a single query like this:

$query
    ->select('name, email')
    ->from('customers')
    ->union($q2->select('name , email')->from('suppliers'))
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

The result set obtained from the union query will actually be a little different from running the individual queries separately because the union query will automatically eliminate duplicates. If you don't mind that the result set may contain duplicates (which mathematically speaking means it isn't a set) then using unionAll instead of union will improve performance.

Lots of ways to use union[edit]

The union (and unionAll) methods are quite flexible in what they will accept as arguments. You can pass an "old-style" string query, a JDatabaseQuery object, or an array of JDatabaseQuery objects. For example, suppose you have three tables, similar to the example above:

$q1->select('name, email')->from('customers');
$q2->select('name, email')->from('suppliers');
$q3->select('name, email')->from('shareholders');

Then all of these queries will produce the same results:

// The union method can be chained.
$q1->union($q2)->union($q3);

// The union method will accept string queries.
$q1->union($q2)->union('SELECT name, email FROM shareholders');

// The union method will accept an array of JDatabaseQuery objects.
$q1->union(array($q2, $q3));

// It doesn't matter which query object is at the "root" of the query.  In this case the actual query that is produced will be different but the result set will be the same.
$q2->union(array($q1, $q3));

union, unionAll and unionDistinct[edit]

There are actually three union methods available.

  • union produces a true set union of the individual result sets; that is, duplicates are removed. The process of eliminating duplicates may or may not incur a performance hit, depending on the data sets and database structures involved.
  • unionAll produces a union of the individual result sets but duplicates are not removed.
  • unionDistinct is identical in behaviour to union and is merely a proxy for the union method.

Using UNION instead of OR[edit]

There are some instances where using union can give a significant performance boost instead of the more commonly used alternative of an OR or an IN in a where clause.

For example, suppose you have a table of products and you want to extract just those products that belong to two particular categories. Typically this would be coded something like this:

$query
    ->select('*')
    ->from('products')
    ->where('category = ' . $db->q('catA'), 'or')
    ->where('category = ' . $db->q('catB'))
    ;
$products = $db->setQuery($query)->loadObjectList();

However, it is likely that you will see a useful increase in performance using union instead:

$query
    ->select('*')
    ->from('products')
    ->where('category = ' . $db->q('catA'))
    ;
$q2
    ->select('*')
    ->from('products')
    ->where('category = ' . $db->q('catB'))
    ;
$query->union($q2);
$products = $db->setQuery($query)->loadObjectList();

Of course, if you want to select products from more than just two categories, you can combine more individual queries together.

You should see a similar performance boost when replacing a where clause containing an IN statement.

Ordering results[edit]

If you want your results to be ordered then you need to be aware of how the database deals with ORDER BY clauses. The following comments apply to MySQL but probably apply to other databases too.

Suppose you want to output the names and email addresses in alphabetical order in the mailshot example above. Then you would simply do this:

$q2
    ->select('name , email')
    ->from('suppliers')
    ;
$query
    ->select('name, email')
    ->from('customers')
    ->union($q2)
    ->order('name')
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

But suppose, for some reason, you wanted to output the names and email addresses in alphabetical order with all customers first, then all suppliers. This query will not give the expected result:

$q2
    ->select('name , email')
    ->from('suppliers')
    ->order('name')
    ;
$query
    ->select('name, email')
    ->from('customers')
    ->order('name')
    ->union($q2)
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

This is because an ORDER BY clause in the individual SELECT statements implies nothing about the order in which the rows appear in the final result. A UNION produces an unordered set of rows. The query above will be syntactically correct, but the MySQL optimiser will simply ignore the ORDER BY clause on the suppliers SELECT statement and the ORDER BY clause on the customers SELECT statement will be applied to final result set rather than the individual result set.

The way around this is to add an additional column to the result set and sort on that in such a way that the results will have the desired ordering. Here's one way to do that:

$q2
    ->select('name , email, 1 as sort_col')
    ->from('suppliers')
    ;
$query
    ->select('name, email, 2 as sort_col')
    ->from('customers')
    ->union($q2)
    ->order('sort_col, name')
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

Advanced ordering[edit]

However, there may be occasions when it is important to have an ORDER BY clause on the individual queries that won't be dropped by the query optimiser. Suppose you want to send a special offer to your top 10 customers and your top 5 suppliers. You will therefore want to apply a LIMIT clause in combination with the ORDER BY clause and in that case the query optimiser will not ignore the ordering. This is how you could do it:

$q2
    ->select('name , email, 1 as sort_col')
    ->from('suppliers')
    ->order('turnover DESC')
    ->setLimit(5)
    ;
$q1
    ->select('name, email, 2 as sort_col')
    ->from('customers')
    ->order('turnover DESC')
    ->setLimit(10)
    ;
$query
    ->select('name, email, 0 as sort_col')
    ->from('customers')
    ->where('1 = 0')
    ->union($q1)
    ->union($q2)
    ->order('sort_col, name')
    ;
$mailshot = $db->setQuery($query)->loadObjectList();

It's necessary to use a dummy query in this case because otherwise the order and setLimit methods would be applied to the final result set rather that the individual one.