Using tfsl to clean grammatical features on Wikidata lexemes

tfsl

tfsl is a Python framework written by Mahir256 to interact with lexicographical data on Wikidata. It has several cool features:

At the moment, it has some limitations:

Grammatical features

A lexeme is linked to a set of forms. These forms can usually be discriminated by their grammatical features. For instance, the lexeme L7 has two forms that can be distinguished by their number: cat (singular) and cats (plural). These grammatical features can be a combination of several traits. For instance, the lexeme L11987 has four forms, distinguished by their gender and their number, like banal, which is masculine and singular, or banales, which is feminine and plural.

It is a best practice to use atomic values for each trait of these grammatical features (easier querying, fewer items to maintain, etc.).

Let’s clean grammatical features on French lexemes!

Finding lexemes with issues

First, we list some items mixing several grammatical features on Wikidata. A key represents a mixed item, like masculine singular (Q47088290), and the values are the items by which it should be replaced, like masculine (Q499327) and singular (Q110786).

Then, we build a SPARQL query to find French lexemes using these mixed items as grammatical features in their forms:

SELECT DISTINCT ?lexeme {
  ?lexeme dct:language wd:Q150 ; ontolex:lexicalForm/wikibase:grammaticalFeature ?feature .
  VALUES ?feature { wd:Q47088290 wd:Q47088292 wd:Q47088293 wd:Q47088295 }
}

Cleaning lexemes

We retrieve each lexeme with tfsl, for instance:

lexeme = tfsl.L('L11987')

Then, we clean the forms of the lexeme:

for form in lexeme.forms:
    for feature in replacements:
        if feature in form.features:
            form.features.remove(feature)
            form.features.update(replacements[feature])

Finally, we update the lexeme on Wikidata:

session = tfsl.WikibaseSession('username', 'password')
session.push(lexeme, 'cleaning grammatical features')

You can see the related diff on Wikidata.

Summary

In a nutshell, we were able to easily clean grammatical features on French lexemes with a few lines of code (the full script is available) using tfsl. While it still has some limitations at the moment, this framework can become in the near future the best way to interact programmatically with lexicographical data on Wikidata.