Tonight I continued dabbling with my little project and experimenting with spatial indexes in Spatialite. I quickly realized that while using indexes benefitted the queries, the questions were too easy and the queries were finishing very quickly regardless of using indexes or not (nice problem to have). Therefore, the benefits of using the indexes were being swamped out by little errors in timings.
So, I made a bigger dataset. As shown below, the number of features I’m testing against went from 183 to 2429. You’d think that would be enough, but stay tuned…
So, armed with this larger dataset I proceeded to test sensitivity to using indexes.
I dinked around with trying to write own my R-tree index queries (the “some index” examples below) and they helped. Then Sandro (author of spatialite) replied to a post I made seeking guidance. There he re-wrote my query (best examples below)and suggested I read about a new, easier way to use spatial indexes here http://www.gaia-gis.it/gaia-
The Original Query
So here is the original query.
SELECT HU_12_NAME FROM huc12 WHERE ST_Contains(Geometry, MakePoint(-70.250,43.802)) = 1
Faced with 2429 polygons, the original query takes about 1.9 seconds on average. My goal is to get results back from anywhere in the country in 1 second or less. Clearly this type of query isn’t going to cut it.
I’m also including the explain plans here. Obviously without an index the query has to do a full scan of the table to figure out which records it needs. Scans are bad. Scans of many records are very bad.
A Better Query
Ok, so to scan less you need to do some filtering based on something to reduce the records you’re talking to. In this case, we use the bboxes from the index tables to grab a more limited set of features to actually perform the more intensive ST_Contains test with just to make sure the point ACTUALLY falls within the set of polygons the BBOX suggests.
SELECT HU_12_NAME FROM huc12 WHERE ST_Contains(Geometry, MakePoint(-70.250,43.802)) AND ROWID IN ( SELECT pkid from idx_huc12_Geometry where xmin < -69 and xmax > -71)
OK, now I was getting ~1.1 seconds pretty reliably. Yep, that’s detectable change between the two queries. So, we have enough records to see the benefits of the improved queries. This was an improvement, but not enough.
The explain plan gets better. We see the filtering and while we’re scanning, it’s more manageable.
So then came Sandro’s reply. In it he guided me to this type of query.. which rocks.
SELECT HU_12_NAME FROM huc12 WHERE ST_Contains(Geometry, MakePoint(-70.250,43.802)) = 1 AND ROWID IN ( SELECT ROWID FROM SpatialIndex WHERE f_table_name = 'huc12' AND search_frame = MakePoint(-70.250,43.802));
And here we see true magic. The original query with no indexes ran for 1.9 seconds. Some index sped things up to 1.1. Now, hold on to your socks, Sandro’s more proper query comes in at 0.16 seconds. That kind of performance really gives me hope.
We don’t learn too much more from the explain plan. But hey, who needs to ask more questions of that kind of speed up.
That was fun. What does it do when stuck into PHP?
First let’s just do the same old “no index” version.
About 2.5 seconds on average. No surprises there.
How about with some spatial index.
About 1.8 seconds on average. Shaved quite a bit off, but we’re still way over the 1 second goal. How about with Sandro’s query?
Cool. We’re below the 1 second mark. I even saw a 0.7 second timing every so often. But, we’re very close to 1 second, so I’m starting to sweat a little. But wait, there’s more. I’ve still got those little filler queries happening on the page, and one of them is a bulky SELECT DISTINCT which is going to scan my larger dataset now. So, what happens when we get rid the crud?
Yes! At 0.26 seconds we’re way below 1 second again. Phew! What about the original query? What happens when we remove the cruft from that page but still use the un-optimized query?
1.9 seconds. Some improvement from the first page, but not where we need it. No worries, we’re moving on!