Converting to SQLite and LIKE Query Optimization

Monday, January 28, 2008

I recently converted our web application .geographical media to use SQLite as its primary database (relegating SqlExpress to the backup position).  This posed some interesting challenges but overall I was very pleased with the result.  As they say on their website sqlite  is small, fast and reliable.  For us, the major reason for change was to reduce load on our hosting servers enabling us to put many more websites on a single server. 

Our product is built with .NET and we used the System.Data.SQLite library built by Robert Simpson which seems to be extremely well written and was super easy to use.  Great job by Robert and thank you. 

One of the most interesting things I faced during the conversion was the experience I had with one of our more common queries.  We use a LIKE query with a wildcard at the end to enable us to search a hierarchical structure for inclusive location information.  We have a path column which contains data like the following:

geo-Africa-Gambia-

geo-Africa-Egypt-

geo-Africa-Egypt-Cairo

geo-Africa-Senegal

Now to query this table for all records in Egypt we use the query:

    SELECT * FROM Table WHERE Path LIKE 'geo-Africa-Egypt-%'

This has always been fine in our SqlExpress days as even when we have had to do complex queries against 10s of thousands of records the response time was always under 100ms.  So it was with shock and concern that after converting to SQLite I suddently found the same queries taking as much as 16 seconds to run on the same machine! 

Fortunately for me a man name Jim Lyon, way back in september of 2003, wrote a document which both explained and solved my dilemma.  His document can be found here  as is an invaluable document about optimizing your SQLite databases and queries.  The crux of the problem is this

 The GLOB and LIKE operators are expensive in SQLite because they can't make use of an index.

There is some reason for this but fortunately for me our LIKE query just has a wildcard at the end.  Jim explains all the details, but in the end it is possible to optimize this query to look like the following:

    SELECT * FROM Table WHERE Path >= 'geo-Africa-Egypt-' AND Path < 'geo-Africa-Egypt-zzz'

There MUST be an index on the Path column for this to work.  By adding the 'zzz' at the end the query will naturally select everything in the alphabetic sort order between the first and second query parameters. 

The end result was spectacular, my newly optimized queries were now running super fast, on both SQLite (~50ms) and even improved on SqlExpress (~60ms).  I have to give credit to Microsoft, as they must have some optimizations already built in for this type of query and its been making my life easy up until this point.     

I think making the switch to SQLite is a good on and can be extremely rewarding for most applications.  Overall, to get the performance I was looking for it made me a bit more honest of a programmer, I had to ensure all my queries were in proper transactions and indexes were explicitly stated.  Sometimes I found the documentation on the SQLite website a bit lacking.