How we imported the Etymological lexicon of modern Breton from Wikisource into Wikidata lexicographical data

The Lexique étymologique du breton moderne (Etymological lexicon of modern Breton) is a dictionary, about the Breton language, written by Victor Henry and published in 1900. Starting in summer 2021, we imported, with Nicolas Vigneron, the content of the dictionary from Wikisource into Wikidata lexicographical structured data. Before the import, there were 283 Breton lexemes (≈ words) in Wikidata; after the import, there were more than 4,000.

This post is divided in three sections:

Project

Victor Henry

Victor Henry is a French philologist and linguist. He wrote several books, including the Lexique étymologique du breton moderne (Etymological lexicon of modern Breton), published in 1900, which is a dictionary, in French, about the Breton language. As Victor Henry died in 1907, more than 70 years ago, its work is in the public domain. The book is available on Wikisource, in plain text formatted with wikicode. With Nicolas Vigneron, we imported the content of the dictionary into Wikidata lexicographical data, in a structured format.

Parsing

The first step was to parse the content of the dictionary to transform it in a machine-readable format. It was an iterative process, with parsing rules defined, implemented and tested step by step. The results were regularly checked in order to:

To help checking the data quality, several human-readable reports were created:

Import

Several lexemes were already present in Wikidata. They were edited before the import to avoid creating duplicates. In our case, we ensured that each lexeme had a statement described by source (P1343) with the value Lexique étymologique du breton moderne (Q19216625) and a qualifier stated as (P1932) with the headword of the corresponding notice in the dictionary (example).

When importing data with a bot into Wikidata, it is mandatory to request a permission. It is a useful process to discuss the import with the community and to gather relevant feedback. In our case, the discussion took place on Wikidata and on the Telegram group dedicated to Wikidata lexicographical data. This step should be started in parallel with the parsing to take the feedback into account early in the development phase.

Once it was considered that the quality of the parsed data was good enough, first edits were made on Wikidata. It allowed to fix a few bugs and to finish the permission process.

The main import took place at the beginning of January 2022, with more than 3,700 lexemes created.

Fixes and enhancements

It was evaluated that it was too difficult to properly import several things, so the imported data is not perfect. There is no obvious errors, but there is still work to do, like senses and etymology. This work is of course collaborative. For instance, you can work on lexemes from the import that still don’t have any sense.

A first online workshop was organized in January 2022. It gathered 7 editors who were able to discover lexicographical data on Wikidata and to improve about 150 lexemes in Breton.

Tools

Several tools were used to facilitate the project, including:

Breton language

Examples of imported lexemes:

Implemented rules

Here is described what was automatically imported from the dictionary:

Not implemented

Here is a list of things that were not automatically imported:

Technical side

The import was made using PHP and Python scripts, divided in three parts:

Crawling

Crawling Wikisource was done with a PHP script. You can use any tool or script language to do the same.

Parsing

Parsing was done with a Python script. At first, it was made with a PHP script, but it was then rewritten when it was decided to use Python for the import step.

Import

There are many possibilities to import data into Wikidata. In our case, we needed a tool which was able to create new lexemes. Here are some options that were considered:

I eventually decided to use Pywikibot for the main import. It had several advantages:

The drawback is that it does not support lexemes. It was necessary to create “by hand” the JSON sent to the Mediawiki API. In this regard, this post on Phabricator was really useful to understand some subtleties, like how to create new forms with the keyword add.

Some refinements were made using QuickStatements.

Source code

The source code is released under CC0 license (public domain dedication). It has already been reused by another project!


Photo by Antoine Meyer, public domain.

Solving Wordle, Sutom, and Gerdle with SPARQL and Wikidata

Wordle

Wordle is a web game where the player has to find an English word. After each guess, the player is given clues like correctly placed letters, misplaced letters, or unused letters. Variants of the game exist in several languages, like Sutom (French) or Brezhle / Gerdle (Breton).

Wikidata is a collaborative knowledge base, including some lexicographical data — you can see it as a dictionary. It can be queried with the standard language SPARQL on the Wikidata Query Service.

Of course, there were several discussions on Twitter on how to solve these puzzles with SPARQL queries on Wikidata. Here, I present a general solution, inspired by previous discussions, to lower the number of needed guesses to find the correct word, using SPARQL queries on Wikidata. It is followed by specific discussions for French and Breton languages.

General solution (English)

First, we want to gather all available forms for a specific language in Wikidata. Having all forms is important in order to have every possible word, and not just the lemmas, which are usually only the singular forms for nouns or the infinitives for verbs.

Here is an example for English (Q1860):

SELECT DISTINCT ?form WHERE {
  [] dct:language wd:Q1860 ; ontolex:lexicalForm/ontolex:representation ?form .
}
ORDER BY ?form

Run the query ↗ (more than 100,000 results)

Length of the word

We want only words of a specific length. For instance 5 letters:

FILTER(STRLEN(?form) = 5)

Run the query ↗ (about 6,000 results)

Correctly placed letter

When we know the positions of some letters, we can apply a new filter with a regular expression.

Here, we are looking for a five-letters word with the letter r in the first position:

FILTER(REGEX(?form, "^r....$"))

The character ^ represents the start of the word and $ its end. The dot can be any letter.

Run the query ↗ (about 300 results)

Misplaced letter

When we know that a letter is present, but not at the correct position, we can apply two filters.

The first filter states that the letter is not at the specific position. Here, the letter s is not at the fourth position:

FILTER(REGEX(?form, "^...[^s].$"))

The second filter states that the letter is present at least once in the word. Here for the letter s:

FILTER(CONTAINS(?form, "s"))

Run the query ↗ (about 2,500 results)

Letter present at most once

Sometimes, we know that a letter is present only once. We can write the following rule to check that the letter e is not present several times:

FILTER(!REGEX(?form, "e.*e"))

Don’t forget to add the following rule to check that the letter is present at least one time:

FILTER(CONTAINS(?form, "e"))

Run the query ↗ (about 45,000 results)

Letter not present

When we know that a letter is not present, we can filter out forms which contain it. Example for the letter a:

FILTER(!CONTAINS(?form, "a"))

Run the query ↗ (about 49,000 results)

Full example

Here is a full (but not final) example:

SELECT DISTINCT ?form WHERE {
  [] dct:language wd:Q1860 ; ontolex:lexicalForm/ontolex:representation ?form .
  FILTER(STRLEN(?form) = 5)
  FILTER(REGEX(?form, "^....e$"))
  FILTER(REGEX(?form, "^..[^e][^i].$"))
  FILTER(CONTAINS(?form, "c"))
  FILTER(CONTAINS(?form, "r"))
  FILTER(!REGEX(?form, "r.*r"))
  FILTER(!CONTAINS(?form, "o"))
}
ORDER BY ?form

Run the query ↗ (about 30 results)

In details:

You can note that several rules overlap. This is because the query is built step by step after each guess. As the query runs quickly, optimization by merging rules doesn’t have much interest here, except maybe for readability.

Ideas for improvement

Instead of sorting forms alphabetically, forms should be sorted by the number of distinct letters they have, in order to have better chance of finding new clues at the next guess. And even better, the count should be made only with letters for which we don’t have rules yet.

French

The same rules can be used in French. However, French has a lot of diacritics like é or è that make hard to write these rules as you would have to list all possible combinations. A solution is to remove the diacritics before applying the rules on forms:

[] dct:language wd:Q150 ; ontolex:lexicalForm/ontolex:representation ?f .
BIND(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(?f, "[àâä]", "a"), "[éèêë]", "e"), "[îï]", "i"), "[ôö]", "o"), "[ùûü]", "u") AS ?form) .

Run the query ↗ (about 24,000 results)

Breton

In Breton, the difficulty is that some letters like ch and c’h are made of several characters. An idea is to replace these letters by jokers, like ch = 0 and c’h = 1.

[] dct:language wd:Q12107 ; ontolex:lexicalForm/ontolex:representation ?f .
BIND(REPLACE(REPLACE(?f, "c'h", "1"), "ch", "0") AS ?form) .

You then have to use these jokers in the rules. For instance, a word that doesn’t contain the letter c’h:

FILTER(!CONTAINS(?form, "1"))

Run the query ↗ (about 4,700 results)

Candidacy for the Wikibase Community User Group

Here is my candidacy for group contact of the Wikibase Community User Group.

Presentation

I’m Envel Le Hir, a data architect, working part time for a major IT Company in France (unrelated to any work about Wikibase and the Wikimedia movement). I’m an active Wikimedian since 2015, with nearly 400K edits on Wikidata.

In the Wikidata community:

I’ve also been involved in the “meta” of the Wikimedia Movement. For instance, I helped solving the crisis that shook Wikimédia France in 2017, by starting the legal process to hold an anticipated general assembly, by representing the chapter at Wikimania 2017 and by being a member of the electoral committee during the most complicated assembly in the history of the chapter.

I have no conflict of interest (to be completely transparent, I worked a few months in a wiki-related startup more than three years ago). I am a member of Wikimédia France, the French Wikimedia chapter, and of April, a French non-profit organization promoting free software.

Involvement in Wikibase and in the Wikibase Community User Group

I work (0.2 FTE) on a personal project using Wikibase at its core.

I write the Wikibase Yearly Summary series (2020, 2021), which gives an overview of what happens around Wikibase.

Specifically on the Wikibase Community User Group:

Plan as a group contact of the Wikibase Community User Group

Here are the topics I would like to work on as a group contact of the Wikibase Community User Group:

Some of these actions can be done without being a group contact and I hope to work on them, regardless of the outcome of the election.

Candidacy

This could have been my candidacy. However, I don’t think I’m a suitable candidate for this position and I will not run for it. I hope the chosen representatives will adopt some of my ideas.


Wikibase logo by H. Snater, CC BY-SA 3.0.