Keyword Searches

by Joe Celko

Here is a short problem that you might like to play with. You are given a table with a document number and a keyword that someone extracted as descriptive of that document. This is the way that many professional organizations access journal articles. We can declare a simple version of this table.

CREATE TABLE Documents
(document_id INTEGER NOT NULL, 
 key_word VARCHAR(25) NOT NULL,
 PRIMARY KEY (document_id, key_word));

Your assignment is to write a general searching query in SQL. You are given a list of words that the document must have and a list of words which the document must NOT have.

We need a table for the list of words which we want to find:

CREATE TABLE SearchList
(word VARCHAR(25) NOT NULL PRIMARY KEY);

And we need another table for the words that will exclude a document.

CREATE TABLE ExcludeList
(word VARCHAR(25) NOT NULL PRIMARY KEY);

Breaking the problem down into two parts, excluding a document is easy.

CREATE TABLE ExcludeList
(word VARCHAR(25) NOT NULL PRIMARY KEY);

Breaking the problem down into two parts, excluding a document is easy.

 SELECT DISTINCT document_id
   FROM Documents AS D1
  WHERE NOT EXISTS
        (SELECT *
           FROM ExcludeList AS E1
          WHERE E1.word = D1.key_word);

This says that you want only the documents that have no matches in the excluded word list. You might want to make the WHERE clause in the subquery expression more general by using a LIKE predicate or similar expression, like this.

WHERE E1.word LIKE D1.key_word || '%'
   OR E1.word LIKE '%' || D1.key_word
   OR D1.key_word LIKE E1.word || '%'
   OR D1.key_word LIKE '%' || E1.word

This would give you a very forgiving matching criteria. That is not a good idea when you are excluding documents. When you wanted to get rid "Smith" is does not follow that you also wanted to get rid of "Smithsonian" as well.

For this example, Let you agree that equality is the right matching criteria, to keep the code simple.

Put that solution aside for a minute and move on to the other part of the problem; finding documents that have all the words you have in your search list.

The first attempt to combine both of these queries is:

 SELECT D1.document_id
   FROM Documents AS D1
  WHERE EXISTS
        (SELECT *
           FROM SearchList AS S1
          WHERE S1.word = D1.key_word);
    AND NOT EXISTS
        (SELECT *
           FROM ExcludeList AS E1
          WHERE E1.word = D1.key_word);

This answer is wrong. It will pick documents with any search word, not all search words. It does remove a document when it finds any of the exclude words. What do you do when a word is in both the search and the exclude lists? This predicate has made the decision that exclusion overrides the search list. The is probably reasonable, but it was not in the specifications. Another thing the specification did not tell us is what happens when a document has all the search words and some extras? Do we look only for an exact match, or can a document have more keywords?

Fortunately, the operation of picking the documents that contain all the search words is known as Relational Division. It was one of the original operators that Ted Codd proposed in his papers on relational database theory. Here is one way to code this operation in SQL.

 SELECT D1.document_id
   FROM Documents AS D1, SearchList AS S1
  WHERE D1.key_word = S1.word
  GROUP BY D1.document_id
 HAVING COUNT(D1.word)
        >= (SELECT COUNT(word) FROM SearchList);

What this does is map the search list to the document's key word list and if the search list is the same size as the mapping, you have a match. If you need a mental model of what is happening, imagine that a librarian is sticking Post-It notes on the documents that have each search word. When she has used all of the Post-It notes on one document, it is a match. If you want an exact match, change the >= to = in the HAVING clause.

Now we are ready to combine the two lists into one query. This will remove a document which contains any exclude word and accept a document with all (or more) of the search words.

 SELECT D1.document_id
   FROM Documents AS D1, SearchList AS S1
  WHERE D1.key_word = S1.word
    AND NOT EXISTS
        (SELECT *
           FROM ExcludeList AS E1
          WHERE E1.word = D1.key_word)
  GROUP BY D1.document_id
 HAVING COUNT(D1.word)
        >= (SELECT COUNT(word)
             FROM SearchList);

The trick is in seeing that there is an order of execution to the steps in process. If the exclude list is long, then this will filter out a lot of documents before doing the GROUP BY and the relational division.

Joe Celko was a member of the ANSI X3H2 Database Standards Committee and helped write the SQL-92 standards. He is the author of over 450 magazine columns and four books, the best known of which is SQL for Smarties (Morgan-Kaufmann Publishers, 1999). He is the Vice President of RDBMS at North Face Learning in Salt Lake City.

Contributors : Joe Celko
Last modified 2005-04-20 10:32 AM

DBAzine.com

Sections

Personal tools

Menu

Who Are You?

Keyword Searches