How we imported the Etymological lexicon of modern Breton from Wikisource into Wikidata lexicographical data
The Lexique étymologique du breton moderne (Etymological lexicon of modern Breton) is a dictionary, about the Breton language, written by Victor Henry and published in 1900. Starting in summer 2021, we imported, with Nicolas Vigneron, the content of the dictionary from Wikisource into Wikidata lexicographical structured data. Before the import, there were 283 Breton lexemes (≈ words) in Wikidata; after the import, there were more than 4,000.
This post is divided in three sections:
- Project: how we proceeded — this part doesn’t require any specific linguistics nor technical knowledge to be read;
- Breton: for people interested in the Breton language;
- Technical: for people interested in the technical side of the project.
Project
Victor Henry is a French philologist and linguist. He wrote several books, including the Lexique étymologique du breton moderne (Etymological lexicon of modern Breton), published in 1900, which is a dictionary, in French, about the Breton language. As Victor Henry died in 1907, more than 70 years ago, its work is in the public domain. The book is available on Wikisource, in plain text formatted with wikicode. With Nicolas Vigneron, we imported the content of the dictionary into Wikidata lexicographical data, in a structured format.
Parsing
The first step was to parse the content of the dictionary to transform it in a machine-readable format. It was an iterative process, with parsing rules defined, implemented and tested step by step. The results were regularly checked in order to:
- validate the rules,
- improve the proofreading of the book in Wikisource. Hundreds of fixes were made in the book on Wikisource during this phase, so this project benefited both to Wikidata and to Wikisource.
To help checking the data quality, several human-readable reports were created:
- obviously, the list of lexemes to be imported,
- a list of parsing errors (for instance, unrecognized lexical categories),
- letters frequencies (in our case, unigrams and bigrams).
Import
Several lexemes were already present in Wikidata. They were edited before the import to avoid creating duplicates. In our case, we ensured that each lexeme had a statement described by source (P1343) with the value Lexique étymologique du breton moderne (Q19216625) and a qualifier stated as (P1932) with the headword of the corresponding notice in the dictionary (example).
When importing data with a bot into Wikidata, it is mandatory to request a permission. It is a useful process to discuss the import with the community and to gather relevant feedback. In our case, the discussion took place on Wikidata and on the Telegram group dedicated to Wikidata lexicographical data. This step should be started in parallel with the parsing to take the feedback into account early in the development phase.
Once it was considered that the quality of the parsed data was good enough, first edits were made on Wikidata. It allowed to fix a few bugs and to finish the permission process.
The main import took place at the beginning of January 2022, with more than 3,700 lexemes created.
Fixes and enhancements
It was evaluated that it was too difficult to properly import several things, so the imported data is not perfect. There is no obvious errors, but there is still work to do, like senses and etymology. This work is of course collaborative. For instance, you can work on lexemes from the import that still don’t have any sense.
A first online workshop was organized in January 2022. It gathered 7 editors who were able to discover lexicographical data on Wikidata and to improve about 150 lexemes in Breton.
Tools
Several tools were used to facilitate the project, including:
- Etherpad, to have a single place to take notes.
- Wikimédia France’s instance of BigBlueButton, for meetings.
Breton language
Examples of imported lexemes:
Implemented rules
Here is described what was automatically imported from the dictionary:
- To avoid mismatches, all apostrophes are converted to vertical apostrophes.
- Each lexeme has:
- a language (Breton).
- a lemma, first one from the headword (values separated by commas).
- a lexical category (noun, verb, etc. full list).
- a reference to the Lexique étymologique du breton moderne, with:
- the page number of the entry,
- a link to the entry on Wikisource,
- the headword of the entry,
- the list of the forms.
- for nouns: a grammatical gender (feminine or masculine), depending on their lexical category.
- for verbs, a conjugation class (mostly regular Breton conjugation).
- forms, from the headword, and where it applies:
- a dialect (Cornouaille, Leon, Tregor, Vannes),
- grammatical features:
- for adjectives: positive,
- for nouns: number (singular or plural), depending on their lexical category,
- for verbs: infinitive.
- lexemes which lemma starts with a star are instance of reconstructed word.
Not implemented
Here is a list of things that were not automatically imported:
- Variants are not merged. For instance, krouilh needed to be manually merged into one lexeme after the import because it appears two times in the dictionary, as Kourouḷ and as Krouḷ.
- Senses are not created.
- Etymology is not filled.
Technical side
The import was made using PHP and Python scripts, divided in three parts:
- crawler.php: to crawl text from Wikisource and to clean useless text in our case.
- parser.py: to parse crawled text, transforming it in Wikibase format, and generate some reports.
- bot.py: to import generated data into Wikidata.
Crawling
Crawling Wikisource was done with a PHP script. You can use any tool or script language to do the same.
Parsing
Parsing was done with a Python script. At first, it was made with a PHP script, but it was then rewritten when it was decided to use Python for the import step.
Import
There are many possibilities to import data into Wikidata. In our case, we needed a tool which was able to create new lexemes. Here are some options that were considered:
- QuickStatements: can update existing lexemes, but not create new ones (T220985).
- OpenRefine: does not support lexemes (#2240).
- Wikidata Toolkit: supports lexemes (#437), but probably not the easiest tool to import data into Wikidata.
- : does not support lexemes (T189321), but can interact with Mediawiki API.
- WikidataIntegrator: lack of examples when reviewed.
- WikibaseIntegrator: fork of WikidataIntegrator, rewrite in progress.
- Example of usage: LexUtils.
- LexData: not updated for several years, multiple forks.
I eventually decided to use Pywikibot for the main import. It had several advantages:
- Well-maintained tool, well-known in the Wikimedia community.
- Easy calls to Mediawiki API, with the method
_simple_request
(even though it is supposed to be private). - Supports bot passwords.
- Supports maxlag.
The drawback is that it does not support lexemes. It was necessary to create “by hand” the JSON sent to the Mediawiki API. In this regard, this post on Phabricator was really useful to understand some subtleties, like how to create new forms with the keyword add
.
Some refinements were made using QuickStatements.
Source code
The source code is released under CC0 license (public domain dedication). It has already been reused by another project!
Photo by Antoine Meyer, public domain.
Solving Wordle, Sutom, and Gerdle with SPARQL and Wikidata
Wordle is a web game where the player has to find an English word. After each guess, the player is given clues like correctly placed letters, misplaced letters, or unused letters. Variants of the game exist in several languages, like Sutom (French) or Brezhle / Gerdle (Breton).
Wikidata is a collaborative knowledge base, including some lexicographical data — you can see it as a dictionary. It can be queried with the standard language SPARQL on the Wikidata Query Service.
Of course, there were several discussions on Twitter on how to solve these puzzles with SPARQL queries on Wikidata. Here, I present a general solution, inspired by previous discussions, to lower the number of needed guesses to find the correct word, using SPARQL queries on Wikidata. It is followed by specific discussions for French and Breton languages.
General solution (English)
First, we want to gather all available forms for a specific language in Wikidata. Having all forms is important in order to have every possible word, and not just the lemmas, which are usually only the singular forms for nouns or the infinitives for verbs.
Here is an example for English (Q1860):
SELECT DISTINCT ?form WHERE { [] dct:language wd:Q1860 ; ontolex:lexicalForm/ontolex:representation ?form . } ORDER BY ?form
Run the query ↗ (more than 100,000 results)
Length of the word
We want only words of a specific length. For instance 5 letters:
FILTER(STRLEN(?form) = 5)
Run the query ↗ (about 6,000 results)
Correctly placed letter
When we know the positions of some letters, we can apply a new filter with a regular expression.
Here, we are looking for a five-letters word with the letter r in the first position:
FILTER(REGEX(?form, "^r....$"))
The character ^
represents the start of the word and $
its end. The dot can be any letter.
Run the query ↗ (about 300 results)
Misplaced letter
When we know that a letter is present, but not at the correct position, we can apply two filters.
The first filter states that the letter is not at the specific position. Here, the letter s is not at the fourth position:
FILTER(REGEX(?form, "^...[^s].$"))
The second filter states that the letter is present at least once in the word. Here for the letter s:
FILTER(CONTAINS(?form, "s"))
Run the query ↗ (about 2,500 results)
Letter present at most once
Sometimes, we know that a letter is present only once. We can write the following rule to check that the letter e is not present several times:
FILTER(!REGEX(?form, "e.*e"))
Don’t forget to add the following rule to check that the letter is present at least one time:
FILTER(CONTAINS(?form, "e"))
Run the query ↗ (about 45,000 results)
Letter not present
When we know that a letter is not present, we can filter out forms which contain it. Example for the letter a:
FILTER(!CONTAINS(?form, "a"))
Run the query ↗ (about 49,000 results)
Full example
Here is a full (but not final) example:
SELECT DISTINCT ?form WHERE { [] dct:language wd:Q1860 ; ontolex:lexicalForm/ontolex:representation ?form . FILTER(STRLEN(?form) = 5) FILTER(REGEX(?form, "^....e$")) FILTER(REGEX(?form, "^..[^e][^i].$")) FILTER(CONTAINS(?form, "c")) FILTER(CONTAINS(?form, "r")) FILTER(!REGEX(?form, "r.*r")) FILTER(!CONTAINS(?form, "o")) } ORDER BY ?form
Run the query ↗ (about 30 results)
In details:
- line 2: word in English
- line 3: word with exactly 5 letters
- line 4: word with exactly 5 letters and ending with the letter e
- line 5: word with exactly 5 letters, without the letter e at the third position, and without the letter i at the fourth position
- line 6: word with at least one time the letter c
- line 7: word with at least one time the letter r
- line 8: word with at most one time the letter r
- line 9: word without the letter o
You can note that several rules overlap. This is because the query is built step by step after each guess. As the query runs quickly, optimization by merging rules doesn’t have much interest here, except maybe for readability.
Ideas for improvement
Instead of sorting forms alphabetically, forms should be sorted by the number of distinct letters they have, in order to have better chance of finding new clues at the next guess. And even better, the count should be made only with letters for which we don’t have rules yet.
French
The same rules can be used in French. However, French has a lot of diacritics like é or è that make hard to write these rules as you would have to list all possible combinations. A solution is to remove the diacritics before applying the rules on forms:
[] dct:language wd:Q150 ; ontolex:lexicalForm/ontolex:representation ?f . BIND(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(?f, "[àâä]", "a"), "[éèêë]", "e"), "[îï]", "i"), "[ôö]", "o"), "[ùûü]", "u") AS ?form) .
Run the query ↗ (about 24,000 results)
Breton
In Breton, the difficulty is that some letters like ch and c’h are made of several characters. An idea is to replace these letters by jokers, like ch = 0 and c’h = 1.
[] dct:language wd:Q12107 ; ontolex:lexicalForm/ontolex:representation ?f . BIND(REPLACE(REPLACE(?f, "c'h", "1"), "ch", "0") AS ?form) .
You then have to use these jokers in the rules. For instance, a word that doesn’t contain the letter c’h:
FILTER(!CONTAINS(?form, "1"))
Run the query ↗ (about 4,700 results)
Candidacy for the Wikibase Community User Group
Here is my candidacy for group contact of the Wikibase Community User Group.
Presentation
I’m Envel Le Hir, a data architect, working part time for a major IT Company in France (unrelated to any work about Wikibase and the Wikimedia movement). I’m an active Wikimedian since 2015, with nearly 400K edits on Wikidata.
In the Wikidata community:
- I contributed mostly about biographies (French politicians, French National Library, family names, etc.).
- I developed Denelezh, a tool that provides statistics about the gender gap in the content of Wikimedia projects. Its replacement by Humaniki is in progress, result of the collaboration of several Wikimedians on this topic.
- I organized and took part in several presentations and workshops about Wikidata, from my local group to international conferences.
- I built the ice breaker of WikidataCon 2019 🙂
I’ve also been involved in the “meta” of the Wikimedia Movement. For instance, I helped solving the crisis that shook Wikimédia France in 2017, by starting the legal process to hold an anticipated general assembly, by representing the chapter at Wikimania 2017 and by being a member of the electoral committee during the most complicated assembly in the history of the chapter.
I have no conflict of interest (to be completely transparent, I worked a few months in a wiki-related startup more than three years ago). I am a member of Wikimédia France, the French Wikimedia chapter, and of April, a French non-profit organization promoting free software.
Involvement in Wikibase and in the Wikibase Community User Group
I work (0.2 FTE) on a personal project using Wikibase at its core.
I write the Wikibase Yearly Summary series (2020, 2021), which gives an overview of what happens around Wikibase.
Specifically on the Wikibase Community User Group:
- I organized two online meetings of the user group (2020-02-20 and 2021-06-09).
- I started, wrote the largest part, involved the community to its writing, and submitted the 2019 annual report of the user group.
- I started and contributed to the 2020 and 2021 annual reports of the user group.
- I improved the structure of the pages about the user group on Meta (automatic discussions archival, navigation box, etc.).
- I’m involved in the main communication channels used by the user group.
Plan as a group contact of the Wikibase Community User Group
Here are the topics I would like to work on as a group contact of the Wikibase Community User Group:
- Clarify the roles of the two affiliates, the Wikibase Community User Group and Wikimedia Deutschland.
- The current position of Wikimedia Deutschland is confusing. At the same time, they state that they don’t want to be involved in the organization of the user group and they also organize meetings in the name of the user group. In my opinion, this leads to less engagement of the Wikibase community in the organization of the user group (for instance, people are less likely to write reports about events they didn’t organize). To my knowledge, the Wikimedia Foundation informally advised Wikimedia Deutschland to stop using the name of the Wikibase Community User Group, and Wikimedia Deutschland chose to ignore this advice.
- A solution could be the signing of a formal agreement between the user group and Wikimedia Deutschland. The goal is to have a real and mutual collaboration between the two affiliates, not a hierarchical relationship.
- Organize meetings of the user group, in a more inclusive and collaborative way.
- Open community decision for the recurrent schedule (the choice of the WLS schedule is an excellent example of the survivorship bias).
- Open community agenda, not something:
- made up by a single person who states they don’t want to be involved in the organization of the user group,
- announced three days in advance.
- Free the mailing-list.
- Take back the control of the mailing-list from its inactive founders.
- Make it more active. For instance, by:
- making its existence known to the Wikibase community,
- forwarding some discussions that take place only on the Wikidata mailing list.
- Share its management with Wikimedia Deutschland if they are interested.
- Restart the Wikibase Registry.
- At the moment, it is managed by a Wikimedia Deutschland employee on their free time. As a consequence, it takes time to solve issues, like this one (I can’t access my account because I lost my password and emails are broken).
- A technical upgrade would be welcome so the Wikibase Registry could be used to showcase Wikibase features. This could maybe be done with WBStack?
- Promote the service so more people register their Wikibase projects. The proactive approach used by Paul-Olivier Dehaye at WikidataCon 2019 was excellent and should be reproduced.
- As a Wikimedia affiliate, strengthen the link between the Wikimedia movement and the Wikibase Community User Group, especially because the Wikibase community is essentially composed of institutions outside the Wikimedia movement.
- Report the activity of the user group to the Wikimedia movement.
- Actively take part in the Wikimedia movement, like participating in the Wikimedia Summit and in the Strategic Wikimedia Affiliates Network.
- Rewrite the main page of the user group on Meta, to be more welcoming to newcomers.
Some of these actions can be done without being a group contact and I hope to work on them, regardless of the outcome of the election.
Candidacy
This could have been my candidacy. However, I don’t think I’m a suitable candidate for this position and I will not run for it. I hope the chosen representatives will adopt some of my ideas.
Wikibase logo by H. Snater, CC BY-SA 3.0.