PHP or SQL?

TAGS:

PHP programmers sometimes ask whether it is better to code a certain database problem in PHP or SQL.  A recent practical example illustrates that SQL frequently is the best solution for speed of execution.

The problem was to scan a MySQL table to find duplicate rows.  The table, which we call the entity_tag table, has three columns: id, contact_id and tag_id.  The id column is the primary key and is guaranteed to be unique.  The other two columns can contain arbitrary integer values, and the schema does not prevent insertion of a row containing a (contact_id, tag_id) pair that duplicates a row already present in the table.

The first attempted solution was to read the table into memory and use PHP code to find duplicates.  On a table of about 17000 rows, this took 10 minutes of execution time using PHP 5.2 and MySQL 5.0.

This was replaced with a solution that did the comparison in SQL and returned only the primary keys of the duplicates:

SELECT t1.id, t2.id  FROM entity_tag AS t1, entity_tag AS t2 WHERE t1.contact_id = t2.contact_id AND t1.tag_id = t2.tag_id AND t1.id != t2.id;

On the same data and software, execution time was less than one second.

Using the PHP PDO extension's PDOStatement::fetch() method with the PDO::FETCH_ASSOC fetch style, the returned result for each row does not include the t1 or t2 qualifier as you might expect.   Instead the returned array has a single key, 'id', and a single value.  In order to retrieve both id values, a solution is to use the PDO::FETCH_NUM style.  This returns an array with two keys, 0 and 1, and the two id values.

The SQL shown above returns every permutation of duplicate rows, so every pair of duplicates is reported twice.  To report each duplicate pair only once, change the last comparison to t1.id < t2.id

Comments

Poor example

I don't think this is a very good example of splitting up the work load between a database or a web server. Doing a simple filter operation on a single table is virtually always best done on the database side. But to use this example as a foundation for saying that "frequently is the best solution for speed of execution." isn't a very strong argument.

I've had this discussion several times and what I've recommend to people is to look at your needs and requirements and go with the technique that works best for those conditions and not to worry about coming up with a hard and fast rule for every situation. If you are working on a project that will only ever have perhaps a dozen simultaneous users and very small/moderate growth patterns than going with a strongly normalized database and pushing some of the work on to the database is often a good thing to do. The flip side, if you are working on a site that has hundreds of thousands of simultaneous users and huge growth patterns then you'll be pushing more work out to the web servers where it is easier to scale and more interested in performance and scale of the database than in keeping things normalized.

In general neither approach is bad or good, better or worse, picking which one to use depends on what your needs and requirements are.

Joseph, We'd like to learn

Joseph,
We'd like to learn from your experience.  Please provide details of the site you worked on that had hundreds of thousands of users and huge growth patterns, where you increased performance by pushing work away from the database and out to the web servers.  Exactly what functions did you push out to the web servers, and how did that impact performance?  This will help us understand what conditions indicate the use of which technique.
-- Walt

  • Showing 1-2 of 2

Post new comment

Image CAPTCHA
Enter the characters shown in the image.