Solving Wordle, Sutom, and Gerdle with SPARQL and Wikidata

by Envel Le Hir — CC BY-SA 4.0 — January 22, 2022 — SPARQL, Wikidata

Wordle

Wordle is a web game where the player has to find an English word. After each guess, the player is given clues like correctly placed letters, misplaced letters, or unused letters. Variants of the game exist in several languages, like Sutom (French) or Brezhle / Gerdle (Breton).

Wikidata is a collaborative knowledge base, including some lexicographical data — you can see it as a dictionary. It can be queried with the standard language SPARQL on the Wikidata Query Service.

Of course, there were several discussions on Twitter on how to solve these puzzles with SPARQL queries on Wikidata. Here, I present a general solution, inspired by previous discussions, to lower the number of needed guesses to find the correct word, using SPARQL queries on Wikidata. It is followed by specific discussions for French and Breton languages.

General solution (English)

First, we want to gather all available forms for a specific language in Wikidata. Having all forms is important in order to have every possible word, and not just the lemmas, which are usually only the singular forms for nouns or the infinitives for verbs.

Here is an example for English (Q1860):

SELECT DISTINCT ?form WHERE {
  [] dct:language wd:Q1860 ; ontolex:lexicalForm/ontolex:representation ?form .
}
ORDER BY ?form

Run the query ↗ (more than 100,000 results)

Length of the word

We want only words of a specific length. For instance 5 letters:

FILTER(STRLEN(?form) = 5)

Run the query ↗ (about 6,000 results)

Correctly placed letter

When we know the positions of some letters, we can apply a new filter with a regular expression.

Here, we are looking for a five-letters word with the letter r in the first position:

FILTER(REGEX(?form, "^r....$"))

The character ^ represents the start of the word and $ its end. The dot can be any letter.

Run the query ↗ (about 300 results)

Misplaced letter

When we know that a letter is present, but not at the correct position, we can apply two filters.

The first filter states that the letter is not at the specific position. Here, the letter s is not at the fourth position:

FILTER(REGEX(?form, "^...[^s].$"))

The second filter states that the letter is present at least once in the word. Here for the letter s:

FILTER(CONTAINS(?form, "s"))

Run the query ↗ (about 2,500 results)

Letter present at most once

Sometimes, we know that a letter is present only once. We can write the following rule to check that the letter e is not present several times:

FILTER(!REGEX(?form, "e.*e"))

Don’t forget to add the following rule to check that the letter is present at least one time:

FILTER(CONTAINS(?form, "e"))

Run the query ↗ (about 45,000 results)

Letter not present

When we know that a letter is not present, we can filter out forms which contain it. Example for the letter a:

FILTER(!CONTAINS(?form, "a"))

Run the query ↗ (about 49,000 results)

Full example

Here is a full (but not final) example:

SELECT DISTINCT ?form WHERE {
  [] dct:language wd:Q1860 ; ontolex:lexicalForm/ontolex:representation ?form .
  FILTER(STRLEN(?form) = 5)
  FILTER(REGEX(?form, "^....e$"))
  FILTER(REGEX(?form, "^..[^e][^i].$"))
  FILTER(CONTAINS(?form, "c"))
  FILTER(CONTAINS(?form, "r"))
  FILTER(!REGEX(?form, "r.*r"))
  FILTER(!CONTAINS(?form, "o"))
}
ORDER BY ?form

Run the query ↗ (about 30 results)

In details:

line 2: word in English
line 3: word with exactly 5 letters
line 4: word with exactly 5 letters and ending with the letter e
line 5: word with exactly 5 letters, without the letter e at the third position, and without the letter i at the fourth position
line 6: word with at least one time the letter c
line 7: word with at least one time the letter r
line 8: word with at most one time the letter r
line 9: word without the letter o

You can note that several rules overlap. This is because the query is built step by step after each guess. As the query runs quickly, optimization by merging rules doesn’t have much interest here, except maybe for readability.

Ideas for improvement

Instead of sorting forms alphabetically, forms should be sorted by the number of distinct letters they have, in order to have better chance of finding new clues at the next guess. And even better, the count should be made only with letters for which we don’t have rules yet.

French

The same rules can be used in French. However, French has a lot of diacritics like é or è that make hard to write these rules as you would have to list all possible combinations. A solution is to remove the diacritics before applying the rules on forms:

[] dct:language wd:Q150 ; ontolex:lexicalForm/ontolex:representation ?f .
BIND(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(?f, "[àâä]", "a"), "[éèêë]", "e"), "[îï]", "i"), "[ôö]", "o"), "[ùûü]", "u") AS ?form) .

Run the query ↗ (about 24,000 results)

Breton

In Breton, the difficulty is that some letters like ch and c’h are made of several characters. An idea is to replace these letters by jokers, like ch = 0 and c’h = 1.

[] dct:language wd:Q12107 ; ontolex:lexicalForm/ontolex:representation ?f .
BIND(REPLACE(REPLACE(?f, "c'h", "1"), "ch", "0") AS ?form) .

You then have to use these jokers in the rules. For instance, a word that doesn’t contain the letter c’h:

FILTER(!CONTAINS(?form, "1"))

Run the query ↗ (about 4,700 results)