https://www.crunchydata.com/blog/postgres-full-text-search-a-search-engine-in-a-database

Crunchy DataCrunchy Data
Products

Featured Products

Crunchy Bridge

Fully managed cloud Postgres available on AWS, Azure, & GCP.

Crunchy Postgres

Integrated high availability PostgreSQL solution for enterprises with
always on requirements.

Crunchy Postgres for Kubernetes

Crunchy Postgres for Kubernetes

Crunchy Hardened Postgres

Hardened, secure, access controlled PostgreSQL to meet advanced
security requirements.

All Crunchy Products

Crunchy Certified PostgreSQL
Crunchy Postgres for Cloud Foundry
Crunchy MLS Postgres
Crunchy Spatial
Cloud

Crunchy Bridge

Overview
Documentation
Pricing
Get Started

Cloud Partners

AWS
Azure
GCP
Heroku
Create your accountLogin
Contact us
Solutions

Industries

Government
Healthcare
Finance
SaaS
Automotive
Blockchain

Use Cases

Internal Database-as-a-Service
Embedded PostgreSQL
Edge Computing
PostGIS

Integrations

Ansible Tower
JFrog
Kong
OpenShift
Learning
Crunchy Data Developer PortalSoftware Documentation
Value of a Subscription
Customers
BlogContact UsDownload
Login
Access PortalCrunchy Bridge
Open menu
Postgres Tutorials

Postgres Full-Text Search: A Search Engine in a Database

Kat BatuigasKat Batuigas

Kat Batuigas

Jul 27, 2021*10 min read

Early in on my SQL journey, I thought that searching for a piece of
text in the database mostly involved querying like this:

SELECT col FROM table WHERE col LIKE '%some_value%';

Then I would throw in some wildcard operators or regular expressions
if I wanted to get more specific.

Later on, I worked with a client who wanted search functionality in
an app, so "LIKE" and regex weren't going to cut it. What I had known
all along was just pattern matching. It works perfectly fine for
certain purposes, but what happens when it's not just a matter of
checking for a straightforward pattern in a single text field?

For example, what if you wanted to search across multiple fields? How
about returning possible matches even if the search term happens to
be misspelled? Also, what if you have very large amounts of data to
search on? Sure, you can create indexes for columns on which you want
to query for pattern matches, but that will have limitations (for
instance, the B-tree index doesn't work for col LIKE '%substring%').

So when we say PostgreSQL is the "batteries included database," this
is just one reason why. With Postgres, you don't need to immediately
look farther than your own database management system for a full-text
search solution. If you haven't yet given Postgres' built-in
full-text search a try, read on for a simple intro.

Postgres Full-Text Search Basics for the Uninitiated

Core Postgres includes the following full-text search capabilities.
To name a few:

  * Ignore stop words (common words such as "the" or "an").
  * Stemming, where search matches can be based on a "root" form, or
    stem, of a word ("run" matches "runs" and "running" and even
    "ran").
  * Weight and rank search matches (so best matches can be sorted to
    the top of a result list).

Before we go further, let's also get ourselves familiarized with the
following concepts:

 1. A document is a set of data on which you want to carry out your
    full-text search. In Postgres, this could be built from a single
    column, or a combination of columns, even from multiple tables.
 2. The document is parsed into tokens, which are small fragments
    (e.g. words, phrases, etc) from the document's text. Tokens are
    then converted to more meaningful units of text called lexemes.
 3. In Postgres, this conversion is done with dictionaries -- there
    are built-in ones, but custom dictionaries can be created if
    necessary. These dictionaries help determine stop words that
    should get ignored, and whether differently-derived words have
    the same stem. Most dictionaries are for a specific language
    (English, German, etc) but you could also have ones that are for
    a specific domain.
 4. The sorted list of lexemes from the document is stored in the
    tsvector data type.

Example: Searching Storm Event Details

I have a table that contains storm events data gathered by the U.S.
National Weather Service. For simplicity's sake I won't include all
possible fields in the statement below, but there's a copy of the
data and some further information available in this repository.

CREATE TABLE se_details (
    episode_id int,
    event_id int primary key,
    state text,
    event_type text,
    begin_date_time timestamp,
    episode_narrative text,
    event_narrative text,
    ...
);

Let's also say that we want to carry out a full-text search on the
data on the event_narrative column. We could add a new column to the
table to store the preprocessed search document (i.e. the list of
lexemes):

ALTER TABLE se_details ADD COLUMN ts tsvector
    GENERATED ALWAYS AS (to_tsvector('english', event_narrative)) STORED;

ts is a generated column (new as of Postgres 12), and it's
automatically synced with the source data.

We can then create a GIN index on ts:

CREATE INDEX ts_idx ON se_details USING GIN (ts);

And then we can query like so:

SELECT state, begin_date_time, event_type, event_narrative
FROM se_details
WHERE ts @@ to_tsquery('english', 'tornado');

tsquery is the other full-text search data type in Postgres. It
represents search terms that have also been processed as lexemes, so
we'll pass in our input term to the to_tsquery function in order to
optimize our query for full-text search. (@@ is a match operator.)

What we get with this query are records where "tornado" is somewhere
in the text string, but in addition to that, here are a couple of
records in the result set where there are also matches for "tornado"
as lexeme ("tornado-like" and "tornadoes"):

state           | KENTUCKY
begin_date_time | 2018-04-03 18:08:00
event_type      | Thunderstorm Wind
event_narrative | A 1.5 mile wide swath of winds gusting to around 95 mph created **tornado-like** damage along Kentucky Highway 259 in Edmons
on County. The winds, extending 3/4 of a mile north and south of Bee Spring, destroyed or heavily damaged several small outbuildings, tore
part of the roof off of one home, uprooted and snapped the trunks of numerous trees, and snapped around a dozen power poles. Several othe
r homes sustained roof damage, and wind-driven hail shredded vinyl siding on a number of buildings.

state           | WISCONSIN
begin_date_time | 2018-08-28 15:30:00
event_type      | Thunderstorm Wind
event_narrative | A swath of widespread tree and crop damage across the southern portion of the county. Sections of trees and crops compl
etely flattened, and some structural damage from fallen trees or due to the strong downburst winds. Various roads closed due to fallen tre
es. Two semi-trucks were overturned on highway 57 in Waldo. The widespread wind damage and tornadoes caused structural damage to many home
s with 70 homes sustaining affected damage, 3 homes with minor damage, 2 homes with major damage, one home destroyed, and 2 businesses wit
h minor damage.

Searching for Phrases

One way to handle phrases as search terms is to use the & (AND) or
<-> (FOLLOWED BY) Boolean operators with the tsquery.

For example, if we want to search for the phrase "rain of debris":

SELECT state, begin_date_time, event_type, event_narrative
FROM se_details
WHERE ts @@ to_tsquery('english', **'rain & of & debris'**);

The search phrase gets normalized to 'rain' & 'debri'. The order
doesn't matter as long as both 'rain' and 'debri' have matches in the
document, such as this example:

    A debris flow caused by heavy rain on a saturated hillside
    blocked the Omak River Road one mile south of the intersection
    with State Route 97.

If we do to_tsquery('english', 'rain <-> of <-> debris') the tsquery
value is 'rain' <2> 'debri', meaning it will only match where 'rain'
is followed by 'debri' precisely two positions away, such as here:

    Heavy rain caused debris flows on the Coal Hollow Fire and Tank
    Hollow Fire burn scars.

(This was actually the only match, so using the <-> operator is a
little bit more restrictive.)

The phraseto_tsquery function can also parse the phrase itself, and
inserts between lexemes where N is the integer position of the next
lexeme when counting from the preceding one. This function doesn't
recognize operators unlike to_tsquery; for example, we can just pass
in the entire phrase like so:

phraseto_tsquery('english', 'rain of debris')

The tsquery value is 'rain' <2> 'debri' like above, so
phraseto_tsquery also accounts for positioning.

Functions for Weighting and Ranking Search Results

One very common use case for assigning different weights and ranking
is searching on articles. For example, you may want to merge the
article title and abstract or content together for search, but want
matches on title to be considered more relevant and thus rank higher.

Going back to our storm events example, our data table also has an
episode*narrative column in addition to eventnarrative. For storm
data, an _event is an individual type of storm event (e.g. flood,
hail), while an episode is an entire storm system and could contain
many different types of events.

Let's say we want to be able to carry out a full-text search on event
as well as episode narratives, but have decided that the event
narrative should weigh more than the episode narratives. We could
define the ts column like this instead:

ALTER TABLE se_details ADD COLUMN ts tsvector
    GENERATED ALWAYS AS
     **(setweight(to_tsvector('english', coalesce(event_narrative, '')), 'A') ||**
     **setweight(to_tsvector('english', coalesce(episode_narrative, '')), 'B'))** STORED;

setweight is a full-text function that assigns a weight to the
components of a document. The function takes the characters 'A', 'B',
'C', or 'D' (most weight to least, in that order). We're also using a
coalesce here so that the concatenation doesn't result in nulls if
either episode_narrative or event_narrative contain null values.

You could then use the ts_rank function in an ORDER BY clause to
return results from most relevant to less.

SELECT ...
**ORDER BY ts_rank(ts, to_tsquery('english', 'tornado')) DESC**;

So, this record is ranked higher in the search results:

state             | MISSISSIPPI
begin_date_time   | 2018-04-06 22:18:00
event_type        | Tornado
event_narrative   | This tornado touched down near the Jefferson Davis-Covington County line along Lucas Hollow Road. It continued southeast, crossing the
 county line. Some large limbs and trees were snapped and uprooted at this location. It then crossed Lucas Hollow Road again before crossing Leonard Road.
 A tornado debris signature was indicated on radar in these locations. The tornado uprooted and snapped many trees in this region. It also overturned a sm
all tractor trailer on Oakvale Road and caused some minor shingle damage to a home. After crossing Oakvale Road twice, the tornado lifted before crossing
Highway 35. The maximum winds in this tornado was 105mph and total path length was 2.91 miles. The maximum path width was 440 yards.
**episode_narrative |** A warm front was stretched across the region on April 6th. As a disturbance rode along this stalled front, it brought copious amounts
of rain to the region thanks to ample moisture in place. As daytime heating occurred, some storms developed which brought severe weather to the region.

Compared to this, where there is a match for "tornado" in
episode_narrative but not event_narrative:

state             | NEBRASKA
begin_date_time   | 2018-06-06 18:10:00
event_type        | Hail
event_narrative   | Hail predominately penny size with some quarter size hail mixed in.
episode_narrative |** Severe storms developed in the Nebraska Panhandle during the early evening hours of Jun
e 6th. As this activity tracked east, a broken line of strong to severe thunderstorms developed. Hail up to
 the size of ping pong balls, thunderstorm wind gusts to 70 MPH and a brief tornado touchdown were reported
. Heavy rain also fell leading to flash flooding in western Keith county.

Tip: ts_rank returns a floating-point value, so you could include the
expression in your SELECT to see how these matches score. In my case
I get around a 0.890 for the Mississippi event, and 0.243 for the
Nebraska event.

Yes, You Can Keep Full-Text Search in Postgres

You can get even deeper and make your Postgres full-text search even
more robust, by implementing features such as highlighting results,
or writing your own custom dictionaries or functions. You could also
look into enabling extensions such as unaccent (remove diacritic
signs from lexemes) or pg_trgm (for fuzzy search). Speaking of
extensions, those were just two of the extensions supported in
Crunchy Bridge. We've built our managed cloud Postgres service such
that you can dive right in and take advantage of all these Postgres
features.

With all that said: as you can see, you don't need a very involved
setup to get started. It's a good idea to try out whether you are
just beginning to explore a full-text search solution, or even just
reevaluating whether you need to go all out for a dedicated full-text
search service, especially if you already have Postgres in your
stack.

To be fair, Postgres doesn't have some search features that are
available with platforms such as Elasticsearch. But a major advantage
is that you won't have to maintain and sync a separate data store. If
you don't quite need search at super scale, there might be more for
you to gain by minimizing dependencies. Plus, the Postgres query
syntax that you already know with the addition of some new functions
and operators, can get you pretty far. Got any other questions or
thoughts about full-text search with Postgres? We're happy to hear
them on @crunchydata.

Enjoyed articles like this?

Get more delivered to your inbox

Do not fill this out please: [                    ]Do not fill this
out please, it will be pre-filled: [                    ]
Email address[                    ]
Join The List
Kat Batuigas

Written by

Kat Batuigas

July 27, 2021 *More by this author

Footer

Products

  * Crunchy Postgres
  * Crunchy Postgres for Kubernetes
  * Crunchy Bridge
  * Crunchy Certified PostgreSQL
  * Crunchy PostgreSQL for Cloud Foundry
  * Crunchy MLS PostgreSQL
  * Crunchy Spatial

Services & Support

  * Enterprise PostgreSQL Support
  * Ansible Tower
  * Red Hat Partner
  * Trusted PostgreSQL
  * Crunchy Data Subscription

Resources

  * Customer Portal
  * Software Documentation
  * Blog
  * Events

Company

  * About Crunchy Data
  * Team
  * News
  * Careers
  * Contact Us
  * Newsletter
  * Security

Crunchy Data Newsletter

Subscribe to the Crunchy Data Newsletter to receive Postgres content
every month.

Do not fill this out please: [                    ]Do not fill this
out please, it will be pre-filled: [                    ]
Email address[                    ]
Join The List
YouTubeLinkedInTwitterGitHub

(c) 2018-2022 Crunchy Data Solutions, Inc.

This site uses cookies for usage analytics to improve our service. By
continuing to browse this site, you agree to this use. Learn more

*