Using Wikidata to check votes of the Wikimedia Foundation board election

Seats of the Wikimedia Foundation Board of Trustees are regularly partially renewed by the community. In 2022, two seats are at stake, in a complex procedure, including a community vote that took place from August 23 to September 6. The list of voters is public, with many details: date of vote, username of the voter, if the vote has been cancelled (because the same account voted a second time, or because the vote has been manually struck).

I wrote a set of scripts to gather data and to compute statistics about this election. The first goal was to automatically check some rules of the voter eligibility guidelines, like that a bot account did not vote (there was at least one, and their vote has been struck after it was reported to the elections committee). Let’s see how we can use Wikidata to perform some basic checks.

On Wikidata, the property Wikimedia username (P4174) links an item to a Wikimedia account. After retrieving the list of voters, it’s easy to check some rules with the related data on Wikidata.

A first idea was to check if a vote was linked to someone supposed to be dead (they are about one hundred wikimedians with a date of death on Wikidata). None was found.

Another idea was to check if someone used several accounts to vote. People can have several Wikimedia accounts for legitimate reasons: a separate bot account, a separate account for each employer when their work is related to Wikimedia, for privacy concerns, and so on (example). With the list of voters, it was checked if several usernames belong to the same item on Wikidata (and so to the same person).

With this method, two voting accounts were found to belong to the same person (Q78170633 is the related item on Wikidata): Masssly (their personal account, voted on 5 September 2022) and Mohammed Sadat (WMDE) (their account as an employee of Wikimedia Deutschland, voted on 26 August 2022). On the one hand, this is surprising because they are a member of the elections committee, supposed to enforce the voter eligibility guidelines, on the other hand, it’s not the first time they cause concerns during an election (fortunately smoothly solved by the community in that case). The issue has been publicly reported on September 7. To my knowledge, it was not followed by any action nor comment; in the meantime, several votes were struck on similar cases.

While these methods allowed to find some fraud, it is very limited: a lot of wikimedians don’t have a Wikidata item, and a lot more don’t have all their usernames on their Wikidata item.

Don’t hesitate to contact me to discuss or to suggest improvements to the project.

Using tfsl to clean grammatical features on Wikidata lexemes

tfsl

tfsl is a Python framework written by Mahir256 to interact with lexicographical data on Wikidata. It has several cool features:

At the moment, it has some limitations:

Grammatical features

A lexeme is linked to a set of forms. These forms can usually be discriminated by their grammatical features. For instance, the lexeme L7 has two forms that can be distinguished by their number: cat (singular) and cats (plural). These grammatical features can be a combination of several traits. For instance, the lexeme L11987 has four forms, distinguished by their gender and their number, like banal, which is masculine and singular, or banales, which is feminine and plural.

It is a best practice to use atomic values for each trait of these grammatical features (easier querying, fewer items to maintain, etc.).

Let’s clean grammatical features on French lexemes!

Finding lexemes with issues

First, we list some items mixing several grammatical features on Wikidata. A key represents a mixed item, like masculine singular (Q47088290), and the values are the items by which it should be replaced, like masculine (Q499327) and singular (Q110786).

Then, we build a SPARQL query to find French lexemes using these mixed items as grammatical features in their forms:

SELECT DISTINCT ?lexeme {
  ?lexeme dct:language wd:Q150 ; ontolex:lexicalForm/wikibase:grammaticalFeature ?feature .
  VALUES ?feature { wd:Q47088290 wd:Q47088292 wd:Q47088293 wd:Q47088295 }
}

Cleaning lexemes

We retrieve each lexeme with tfsl, for instance:

lexeme = tfsl.L('L11987')

Then, we clean the forms of the lexeme:

for form in lexeme.forms:
    for feature in replacements:
        if feature in form.features:
            form.features.remove(feature)
            form.features.update(replacements[feature])

Finally, we update the lexeme on Wikidata:

session = tfsl.WikibaseSession('username', 'password')
session.push(lexeme, 'cleaning grammatical features')

You can see the related diff on Wikidata.

Summary

In a nutshell, we were able to easily clean grammatical features on French lexemes with a few lines of code (the full script is available) using tfsl. While it still has some limitations at the moment, this framework can become in the near future the best way to interact programmatically with lexicographical data on Wikidata.

How we imported the Etymological lexicon of modern Breton from Wikisource into Wikidata lexicographical data

The Lexique étymologique du breton moderne (Etymological lexicon of modern Breton) is a dictionary, about the Breton language, written by Victor Henry and published in 1900. Starting in summer 2021, we imported, with Nicolas Vigneron, the content of the dictionary from Wikisource into Wikidata lexicographical structured data. Before the import, there were 283 Breton lexemes (≈ words) in Wikidata; after the import, there were more than 4,000.

This post is divided in three sections:

Project

Victor Henry

Victor Henry is a French philologist and linguist. He wrote several books, including the Lexique étymologique du breton moderne (Etymological lexicon of modern Breton), published in 1900, which is a dictionary, in French, about the Breton language. As Victor Henry died in 1907, more than 70 years ago, its work is in the public domain. The book is available on Wikisource, in plain text formatted with wikicode. With Nicolas Vigneron, we imported the content of the dictionary into Wikidata lexicographical data, in a structured format.

Parsing

The first step was to parse the content of the dictionary to transform it in a machine-readable format. It was an iterative process, with parsing rules defined, implemented and tested step by step. The results were regularly checked in order to:

To help checking the data quality, several human-readable reports were created:

Import

Several lexemes were already present in Wikidata. They were edited before the import to avoid creating duplicates. In our case, we ensured that each lexeme had a statement described by source (P1343) with the value Lexique étymologique du breton moderne (Q19216625) and a qualifier stated as (P1932) with the headword of the corresponding notice in the dictionary (example).

When importing data with a bot into Wikidata, it is mandatory to request a permission. It is a useful process to discuss the import with the community and to gather relevant feedback. In our case, the discussion took place on Wikidata and on the Telegram group dedicated to Wikidata lexicographical data. This step should be started in parallel with the parsing to take the feedback into account early in the development phase.

Once it was considered that the quality of the parsed data was good enough, first edits were made on Wikidata. It allowed to fix a few bugs and to finish the permission process.

The main import took place at the beginning of January 2022, with more than 3,700 lexemes created.

Fixes and enhancements

It was evaluated that it was too difficult to properly import several things, so the imported data is not perfect. There is no obvious errors, but there is still work to do, like senses and etymology. This work is of course collaborative. For instance, you can work on lexemes from the import that still don’t have any sense.

A first online workshop was organized in January 2022. It gathered 7 editors who were able to discover lexicographical data on Wikidata and to improve about 150 lexemes in Breton.

Tools

Several tools were used to facilitate the project, including:

Breton language

Examples of imported lexemes:

Implemented rules

Here is described what was automatically imported from the dictionary:

Not implemented

Here is a list of things that were not automatically imported:

Technical side

The import was made using PHP and Python scripts, divided in three parts:

Crawling

Crawling Wikisource was done with a PHP script. You can use any tool or script language to do the same.

Parsing

Parsing was done with a Python script. At first, it was made with a PHP script, but it was then rewritten when it was decided to use Python for the import step.

Import

There are many possibilities to import data into Wikidata. In our case, we needed a tool which was able to create new lexemes. Here are some options that were considered:

I eventually decided to use Pywikibot for the main import. It had several advantages:

The drawback is that it does not support lexemes. It was necessary to create “by hand” the JSON sent to the Mediawiki API. In this regard, this post on Phabricator was really useful to understand some subtleties, like how to create new forms with the keyword add.

Some refinements were made using QuickStatements.

Source code

The source code is released under CC0 license (public domain dedication). It has already been reused by another project!


Photo by Antoine Meyer, public domain.