Posts Tagged sql server full-text search

SQL Server Full-Text Search

If you’re searching within text (string) data in SQL Server, the easiest method is to use the LIKE operator with the wildcard ‘%’ character, e.g.:

SELECT * FROM Customers WHERE FirstName LIKE 'Pete%'

will return every customer with the first name Pete or Peter, plus less common forenames like Peterjohn and Peterson. The search string (‘Pete%’) can contain multiple wildcards, and can be used to look for ranges or sets of characters. LIKE is quite powerful, although sadly not as powerful as a full-featured regular expression parser (something that Unix/Linux scripters will be very familiar with). More importantly, LIKE can’t always use column indexes that you think might be appropriate. In the above query, if FirstName was indexed, then that index would be used, but if the search string was e.g. ‘%illy’ (looking for ‘Milly’, ‘Gilly’, ‘Tilly’ or ‘Billy’), then it wouldn’t.

For a more high-powered approach to searching text within database tables, SQL Server has Full-Text Search. From that link:

Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

I last used Full-Text Search a very long time ago — maybe in the days of SQL Server 7.0! — so I thought it was time to refresh my knowledge.

The dataset

I needed a reasonably large dataset for testing, so I downloaded the Free Company Data Product,
which is a “data snapshot containing basic company data of live companies on the register” at Companies House — basically, a CSV of every company currently incorporated here in the UK. The dataset I downloaded, dated 2024-02-07, was 452Mb zipped, 2.5Gb unzipped, and contains 5,499,110 rows and 55 columns.

After downloading and unzipping, I imported the CSV into SQL Server, then transferred it to a table with appropriate datatypes, also doing some tidying like removing NULL columns, and normalising out some categories to lookup tables. My main cleaned-up table is called cd.CompanyData (‘cd’ is the schema).

Not that it matters for what I’m doing here, but the data is pretty ropey in places; for example, the ‘previous name’ date fields unexpectedly switch from yyyy-mm-dd to dd/mm/yyyy. There are lots of duff postcodes, countries can have several slightly different names, and the address data is generally fairly low quality. For an important dataset from an official government source, I expected a lot better! If I was going to use this data in a professional setting, it’d take me a couple of days and a high-quality address cleaner to make it acceptable.

Creating the full-text index

I’m going to assume that you have the Full-Text Search component installed; it was literally a matter of ticking a checkbox on installation of SQL Server 2022 Developer edition. To check if it’s installed on your SQL Server instance, run:

SELECT FULLTEXTSERVICEPROPERTY('IsFullTextInstalled')

‘1’ means yes, ‘0’ means no.

Firstly, we have to create a Full-Text Catalog. Quote: “You have to create a full-text catalog before you can create a full-text index” — it’s merely a logical container for the full-text indexes we’ll create. In practice, it’s one line of SQL:

CREATE FULLTEXT CATALOG myFirstFullTextCatalog AS DEFAULT

Now we can create our first full-text index; we need three things: the table name (cd.CompanyData), the column we want to index (CompanyName) and the name of a unique key on that table (we can use our primary key, PK_cd_CompanyData):

CREATE FULLTEXT INDEX ON cd.CompanyData(CompanyName)
   KEY INDEX PK_cd_CompanyData
   WITH STOPLIST = SYSTEM

See CREATE FULLTEXT INDEX for more details. Note that we didn’t have to specify the catalog, as we’ve already declared a default. (Also, it doesn’t have to be a table we’re indexing, we can also add full-text indexes to indexed views.)

What does ‘STOPLIST = SYSTEM’ mean? The STOPLIST is a list of words that aren’t indexed, words like ‘and’, ‘but’, ‘in’, ‘is’. To see the SYSTEM list we’ve used here:

SELECT * FROM sys.fulltext_system_stopwords WHERE language_id = 1033

For me, it returns 154 rows, a list of words and some single characters. (NB: 1033 refers to English) You can curate your own STOPLISTs: Configure and Manage Stopwords and Stoplists for Full-Text Search

IMPORTANT: You can only add one full-text index to a table (or indexed view), but you can have multiple columns per index.

Using the index

You’ll notice that the CREATE FULLTEXT INDEX statement returns immediately; that doesn’t mean the index has been created that quickly, just that the indexing process has started. To keep an eye on the progress, I use the code from this stackoverflow answer. (For the index above, it took a couple of minutes on my old and severely under-powered Windows 10 PC; for multiple columns, it a took a bit longer.)

There are a number of ways to query the index: see Query with Full-Text Search, and Full-Text Search and Semantic Search Functions. Here, we’ll use the table function FREETEXTTABLE.

Let’s query the data, and look for all companies with the word ‘Tonkin’ in, order them by newest incorporated first:

;WITH cte_Search AS
(
	SELECT
			ft.[KEY]
			,ft.[RANK]
		FROM FREETEXTTABLE(cd.CompanyData, CompanyName, 'Tonkin') ft
)
SELECT
		c.CompanyNumber
		,c.CompanyName
		,c.IncorporationDate
		,ft.[RANK] 
	FROM cte_Search ft
	JOIN cd.CompanyData c
		ON c.CompanyNumber = ft.[KEY]
	ORDER BY c.IncorporationDate DESC

There are 30 results, first 16 shown here:

A recordset of companies with TONKIN in the name

The results are returned within a fraction of a second; the equivalent query using LIKE took nearly 10 seconds.

Aside: Why have I written the query using a CTE, instead of a simple JOIN? When researching this post, I came across this from Brent Ozar: Why Full Text’s CONTAINS Queries Are So Slow

Apart from the increase in speed, the full-text index includes related words, e.g. if I change the query to search for ‘machine’, I get 2153 results where the company names contain ‘MACHINE’, ‘MACHINES’ and ‘MACHINING’. If I search for ‘move’, I get 2497 results where the company names contain ‘MOVE’, ‘MOVING’, ‘MOVED’, ‘MOVES’ and ‘MOVERS’.

But what’s going on with the RANK column? Excellent question, why are they all the same number? It’s to do with the query string we’re passing to FREETEXTTABLE; if it was more complicated, e.g. ‘move OR clean’, then I get eight different values for RANK: 9,13,17,24,30,40,41,43. This page, Limit search results with RANK, has some interesting tidbits.

Loads more to do…

I’ve only lightly scratched the surface here; I know from experience that Full-Text Search is something to seriously consider for a production system, but I’d like a few days of playing around with real data, and real use-cases, before committing. If you’ve an on-going need to locate pieces of text in a large table (e.g. customer search functionality), there are a few things that can be done without installing Full-Text Search:

  • Ensuring current indexes are adequate, and actively being used by the SQL engine
  • Indexing the first few characters of a string
  • Splitting out words or other parts of strings into a separate table and indexing that
  • Caching the most popular searches
  • Storing or replicating the data to a Memory-Optimized Table
  • Hashing words, pairs/triples of words, or full phrases, and indexing the hashes
  • Enforcing other criteria to cut-down the search space, e.g. live customer? Applied this year?

and I’m sure I can come up with a few more, given time!

, , , ,

Leave a comment