Fetch random rows faster with recycled IDs

Say you need to quickly fetch a random row from a database table. What’s your approach? If you answered:

SELECT * FROM tablename ORDER BY rand() LIMIT 1

…you’re probably in the majority, and in many cases right. However, the larger the table, the higher the performance penalty of this solution will be. That’s because you’re not really selecting a random row – you’re selecting all of them, sorting them by an arbitrary number and selecting the one that gets sorted as the first one.

How are you supposed to do it, then? You have many options, but so far a single one has yet to stand out. One commonly used solution is using a random number against the table’s primary ID. Generate a random number between the minimum and maximum IDs, fetch that row and you should be good, right? Absolutely, but that’s assuming you have no gaps in your ID sequence. If you hit an ID that does not exist, you’ll have to either reroll or use WHERE id >= $myrandomnumber. Imagine an exaggerated example of having a gap of one million rows in your ID sequence and you’ll see where the problem is: the row right after the gap is a million times more likely to be selected than the others.

The solution I’ve come up with is adding a secondary indexed column called id2. This is actively kept as gapless as possible so a random number between MIN(id2) and MAX(id2) should hit almost always, or at least with a minimal number of rerolls. This way random rows can be selected lightning-fast, while maintaining an even and unbiased distribution.

Now all we need is a way to keep the table gapless with minimal overhead. Constantly rebuilding the sequence every time a row is inserted or deleted will make both operations really slow, effectively negating the benefits. Instead, let’s set up a recycling bin for the id2s of deleted rows:

CREATE TABLE `recycle_ids` (
  `table_name` varchar(64) default NULL,
  `id2` int(11) default NULL,
  KEY `table_name` (`table_name`,`id2`)
) ENGINE=MyISAM

NOTE: Though I’m using MySQL and PHP in these examples, the solution itself is language agnostic.

Assuming your id2 column in your target table is already set up, you’re good to go! Whenever you delete a row, insert the deleted row’s id2 into the recycler:

// After deleting a row...
mysql_query("DELETE FROM test_data WHERE id2 = $random_row");
// ...we also add the id2 of the deleted row to the list of recycled ids
mysql_query("INSERT INTO recycle_ids (table_name, id2) VALUES ('test_data', $random_row)");

When inserting a new row, check the recycler first. If there are no id2s available, create a new one:

function _get_next_id($table = 'test_data') {
	// Check if a recycled ID is available
	$res = mysql_query("SELECT id2 FROM recycle_ids WHERE table_name = '$table' LIMIT 1");
	if(!$res) {
		die(mysql_error());
	}
	if(mysql_num_rows($res)==0) {
		// No recycled IDs available, let's use a new one
		$res = mysql_query("SELECT MAX(id2) as max_id FROM $table");
		$row = mysql_fetch_assoc($res);
		return $row['max_id']+1;
	}
	$row = mysql_fetch_assoc($res);
	// The ID has been recycled so we remove it from the list
	mysql_query("DELETE FROM recycle_ids WHERE table_name='$table' AND id2='{$row['id2']}'");
	return $row['id2'];
}

That’s all there is to it! Well, in a real world scenario we’d of course have better error handling and probably use a smarter database interface than the basic MySQL client library (PDO would be my choice) but for this example I chose the one I think most are familiar with.

Now for the most important question: What are the benchmarks? First of all, here’s the test script I used to insert the test data:

// The test loop. Add rows and every five iterations delete a random row.
for($i=1; $i<=$amount; $i++) {
	insert_row();
	if($i % 5 == 0)
		delete_random_row();
}

Running the loop 10,000 times against a table with 150,000 rows already in it takes 18.8 seconds on my low-spec virtual server. That’s 0.0018 seconds per row, and that includes deleting a random row every 5 iterations.

How about the speed of random row selection compared to ORDER BY rand()? Selecting a random row 100 times using the “traditional” method takes 75.5 seconds on my test table. In comparison, using id2 the same takes 0.03 seconds.

While this isn’t an all-powerful tool to tackle every situation, it’s definitely something you should try out if random queries are giving you performance woes. Whether it works for you or not, I’d be happy to hear from you!

This entry was posted in Ideas, Optimization, SQL. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>