Ever checked out PostgreSQL’s full text search feature? It’s easy to use, and, together with Go, makes it possible to quickly build web apps like this one here – Scenes of Shakespeare – that can search through all of Shakespeare’s plays.
Check out the app now! Like it? Let’s see how to build it!
Storing the data
Each of Shakespeare’s 37 plays are divided into acts, and each act into scenes. We’ll store each scene as a row in the scenes table, which looks like this:
We want to perform full text searches over the “description” and the “body”
fields, and give a higher weightage to the “description” field. The way to
do this in Postgres is to add a column of type tsvector
.
As the contents of this column, we’ll store the weighted lexemes of
“description” and “body” columns.
Say what? A lexeme is a normalized representation of piece of English text, so that a query for “move” can match texts like “moving”, “moved” etc. Check this out:
to_tsvector is how you convert plain text to tsvector. See how all the three words end up as the same lexeme? There’s more about this stuff that we cannot cover here, but read this if you want to know more.
We also want to give a higher weightage to the scene description, relative to the
text of the scene, so that a match in the description is ranked higher than a match
in the text. Postgres supports four weights, called unimaginatively as A, B, C and D.
By default, A ranks higher than B, which ranks higher than C etc.
You add a weight to a tsvector
using the setweight
function, like this:
Now let’s put it all together. First let’s add the tsvector column, which we’ll call “tsv”:
Next, we’ll populate it with the weighted lexemes of the other columns. We’ll give a weight of ‘A’ to the description and ‘B’ to the body:
As a last step, we’ll create an index on the tsv
column. This will let Postgres
run our full text queries against the index rather than the table rows. Postgres
has different types of indexes,
and for our purpose we’ll use GIN (Generalized Inverted Index) – read this
if you want to know why.
The actual data
The site Open Source Shakespeare has downloadable databases of structured Shakespeare texts. While we can’t use this directly, here is a manipulated version of it, with the tables like we described above.
If you’re following along at home, be sure to download init.sql.gz and execute it on an empty database.
Querying the data
Now we’re all set to query the data. The query goes something like this:
We see a @@
operator, and an operand called q
. The @@ operator
is a boolean operator that returns true
when the right-hand-side value of type
tsquery
matches the left-hand-side value of type tsvector
. Here q
is a value
of type tsquery
.
A tsquery
, described here,
holds lexemes and operators that are used to match the lexemes against a tsvector. For e.g.,
a tsquery of 'heaven' & 'earth'
means that both the lexemes “heaven” and “earth”
should be present in the tsvector. There are other operators, including the “followed by”
operator that was just introduced in 9.6. However, we’ll not dive deeper into tsquery,
and will stick with a simple way of producing tsquery values from plain text:
or equivalently:
Here the plainto_tsquery
function
is used to create a tsquery value from simple text by joining it’s lexemes with the AND operator.
The query is perfectly valid. Run it now if you have the database ready. It should return 157 rows. However, it makes sense to return only the top 10 or so – but wait! hast not thou forgotten something? What about the ranking?
There are two rank functions – ts_rank
and ts_rank_cd
, described here
– that can quantify “how well does the tsquery match the tsvector?”. For example,
here are two lines that are matched against “smell sweet”, and although both
match, one matches “better”:
Let’s add this into our query, along with a limit of 10 rows:
There’s just one more thing left to do. As it stands, the query returns the whole of
the body column. What we really want are highlighted snippets from the body, just like
how a Google search returns snippets from a page with parts highlighted. For this,
use the ts_headline
function:
This will add HTML bold start/end tags (this is the default, you can change this) around matching text, and returns only snippets of relevant text:
However, just fetching ts_headline(body, q) instead of body has a gotcha: ts_headline will be called for each row in the table, before the WHERE clause is applied. And ts_headline is expensive. It works by lexemizing the whole text, matching it with the tsquery and then inserting the HTML tags. So instead, we’ll use a subquery:
The inner query will select 10 rows, each returning the full body, and the outer query will convert the body to ts_headline(body). This way the cost of ts_headline remains proportional to the search result limit of 10, rather than the number of rows in the table.
And that is our final query. (Well almost, since the actual code also has another table that maps “workid” to the full name of the play.)
Making a Web App
Let’s put together a Go-based web app to let people interactively search through the scene text. Thanks to Go’s built-in HTTP server and HTML templates, making a 3-page web app is rather trivial.
Connecting to the Postgres database is done via lib/pq, which provides a pure-Go database/sql driver for Postgres. This is the only dependency of our app.
The whole code, excluding the templates and the SQL queries is only about 120 LoC – read it right here. It has a couple of handler functions that display the web pages, rendered via templates. The error check is deliberately simple, and any errors result in a 404.
The app is deployable to Heroku, but remember to setup the database first. Here it is on Heroku, running on a free tier.
The code is available on Github here, and is MIT-licensed. Feel free to fork it and use it for all your text search needs!
Closing notes
It’s fairly easy to build a tool like this using Postgres and Go, and arguably simpler to run and maintain than a comparable Elasticsearch/Solr-based design.
PostgreSQL supports full text search in other languages too, like German, Italian,
Spanish etc. Use \dF
at the psql prompt to see the full list.
We also didn’t exploit all things that tsquery
is capable of. In 9.6, tsquery got
another operator (<N>
) that can be used to match phrases,
which is really much nicer than &
.
The PostgreSQL documentation about Full Text Search is the best place to start digging further.
Happy searching!