Dedicated Dashboard, another tool to improve the data quality in Wikidata

At Wikimania 2018, Lydia Pintscher, Product Manager for Wikidata at Wikimedia Deutschland (WMDE), presented a cool poster about data quality tools for Wikidata. Several tools are cited, including:

In my opinion, this list lacks the tools dedicated to a specific topic, like the dashboard I made about French members of parliament (MPs) and which first version was released in 2017.

A dedicated dashboard

The goal of this tool is to provide a quick and convenient way to monitor data quality about the members of French National Assembly in the Fifth Republic. It is divided in two components:

Each row of the dashboard represents a business rule, the first ones providing general statistics (for example number of MPs) and the following ones data quality issues (for example MPs without France as country of citizenship). To facilitate the work on this set, the data is broken down by time and each column of the dashboard represents a legislative term.

When you click on a cell, MPs meeting the criteria (both from the line and from the column) are listed. For each MP, the tool provides a link to the Wikidata item, but also to the French Wikipedia, and to useful third-party databases, like French National Assembly. This allows Wikidata contributors to easily find data quality issues on this topic and to quickly fix them with reliable sources.

Statistics in the main dashboard are updated once a day, lists of MPs are live from the Wikidata Query Service (WDQS, which has a 5 minutes cache).

Generalizing

I generalized the features of this tool into a new one, Dedicated Dashboard, which allows you to set up your own dashboard on the topic of your choice, using SPARQL queries to populate it. Several examples are provided:

Only the links to Wikimedia projects and third-party databases are not yet implemented. The tool also needs a way to list existing dashboards, allowing users to easily manage them (at the moment, the configuration of a dashboard is stored in its URL).

Happy birthday Wikidata 😉

Denelezh 2.0, a transitional version

As of October 31st, 2018, this tool was discontinued and now redirects to WDCM Biases Dashboard made by Wikimedia Deutschland (WMDE). You can also use Wikidata Human Gender Indicators (WHGI).

At the beginning of April, a new version of Denelezh, a tool to explore the gender gap in the content of Wikadata and Wikimedia projects, was released. This post explains what led to this new version, including the choice of a new methodology to generate the metrics, and what you can expect in future releases. Finally, a technical overview of the tool is provided.

What’s new

A 4th dimension

Since its inception, Denelezh provides multidimensional analysis. You can explore the gender gap in Wikidata by several dimensions: the year of birth, the country of citizenship, and the occupation of a human. It is possible to combine these dimensions, for example to have metrics on the gender gap for French politicians born in 1901.

The most visible improvement in this new version of Denelezh is the addition of a fourth dimension: the Wikimedia project. All projects that have at least one page about a human according to Wikidata are included: not only the English Wikipedia, but also the young Atikamekw Wikipedia, Wikimedia Commons, Wikispecies, the Polish Wikiquote, …

Data is still extracted from Wikidata using its weekly dump. Thus, you can go back in time to observe the evolution of metrics you are interested in. For example, the French Wikipedia had 16.0 % of its biographies about women in January 2017, 16.3 % in July 2017, 16.6 % in January 2018, and is now, in April 2018, at 16.8 %. It seems encouraging but, in the meantime, 33,107 biographies about men were added in the French Wikipedia and only 11,202 about women.

A new methodology

Although it is less visible, the most important improvement in this version is the new methodology to generate the statistics. The idea is to generalize the statistics produced in the first version of the tool.

In the previous version, only around 50 % of humans in Wikidata were kept, mainly in the hope that keeping only humans with all studied dimensions would improve the quality of the metrics provided by the tool. The problem is that this hypothesis was never confirmed (nor contradicted). Now, all data available is used, and in particular:

The tool does not try anymore to provide statistics about biases by introducing new biases 🙂

Other improvements include:

Future

Main features

Even if they need to be clearly defined, the main new features will be:

It will also be the transformation of Denelezh into a more general tool, as explained in the next section.

Data quality

I already worked on data quality in Wikidata, for example by cleaning BnF IDs (in French) or by contributing to Wikidata about the members of the French parliament with Dicare (in French). In this last case, a dedicated dashboard (in French) provides statistics on the data held by Wikidata about members of the French National Assembly, legislature by legislature, and insights on what needs to be improved.

The idea is to provide, with Denelezh, a general dashboard to help Wikimedians to contribute about humans, with not only data on the gender gap but also other metrics, like missing properties (number of people without a gender, without a date of birth, …).

Usability

Usability is an important topic that needs to be covered. For example, the form needs to be more understandable and to have a dedicated documentation. A lot of little things can drastically improve the tool, like to provide links to Wikidata items, links to the Wikidata Query Service to have live results, exports in CSV format… Finally, Denelezh needs internationalization: it’s quite ironic to have an application about gaps only available in English!

Technical overview

Architecture

The tool is still divided in three parts:

In order to have reproducible results, the Wikidata Query Service is not used anymore (it was only used for labels in the previous version).

Some metrics

Denelezh is installed on a dedicated server with an i5-3570S CPU, 16 GB of RAM, and a slow hard disk, running Debian 8 (Jessie) as the operating system, nginx as the web server, and MySQL 5.7 as the relational database. The processing of the most recent dump (2018-04-09) takes around 11 hours:

From this dump, 29,338,817 sets with at least one human were generated. The corresponding MySQL data file is about 2.7 GB. Data from each dump is stored in a separate MySQL partition to improve performance and to ease maintenance.

Feedbacks

Feel free to send feedbacks, by email (envel -at- lehir.net) or on my Wikidata talk page.

The following list is a synthesis of possible evolutions of the tool, collected from the Wikimedia community, both online (including on Wikimedia projects like Wikidata, English Wikipedia, French Wikisource, etc. but also on social networks like Twitter or Telegram) and offline (for instance at Wikimania, WikidataCon, volunteers meetings, etc.).

Features

Type Name Description
Core feature Evolution Each metric should be traceable over time. Example: evolution of the gender gap on English Wikipedia for the last two years, with a value every month.
Core feature Comparison The metrics from two sets should be comparable. Example: compare occupations from two Wikipedias.
Metrics Base metrics The following statistics should be available for each set:

  • total number of humans
  • number of humans with exactly one gender [already exists]
  • number of females, males, and others [already exists]
  • number of humans with at least / exactly one distinct year of birth (at preferred rank)
  • number of humans with at least / exactly one distinct year of death (at preferred rank)
  • number of humans with at least / exactly one distinct place of birth (at preferred rank)
  • number of humans with at least / exactly one distinct place of death (at preferred rank)
  • number of humans with at least one country of citizenship (at preferred rank)
  • number of humans with at least one occupation (at preferred rank)
  • number of humans with at least one image (at preferred rank)
  • number of humans with at least one given name (at preferred rank)
  • number of humans with at least one family name (at preferred rank)
Metrics External ID metrics External ID should be another available dimension (in addition to year of birth, country of citizenship, occupation, and project). Note: very expensive, may need optimization (probably architecture change) or features limitations (removal of another dimension, drill down limited to N levels, …).
Metrics Fictional content Exploration of instances of fictional human (Q15632617), in addition to human (Q5), should be possible.
UI/UX Internationalization The application should be available in more languages than just English.
UI/UX Sub-occupations Display sub-occupations deduced by the subclass of property to facilitate the study of a set and drill down.
UI/UX Set of projects Metrics should be available by set of projects (i.e. all Wikipedias, all Wikisources, …).
UI/UX Time intervals Improve time intervals display: show metrics year by year, decade by decade, century by century, etc. in accordance with the size of the time interval chosen.
UI/UX Links to Wikidata Each concept should be linked to its Wikidata item.
UI/UX Links to projects Each project should be linked when cited, with a friendly name (not its code).
UI/UX Links to WDQS Each set should be linked to a SPARQL query to retrieve live data.
API API Provide an API so the data could be used by third-parties.
API Export Statistics should be available in CSV format.
Other Cleaning Provide a list of barely used countries and occupations.

Technical

Category Name Description
Bug Graphics artifacts Sometimes, there is a blank area between the orange (female) and green (male) areas, even when there is no human with other gender in the set.
Optimization Database schema optimization In the database, BIGINT should be replaced by INT in many places (in Wikidata, the ids are far from the INT limit of 4,294,967,295).
Optimization Labels import optimization Only load into database the labels that are used (at the end of the Wikidata Toolkit job, generate a second label.csv file with only useful labels before loading them into database).

Bilan de Dicare

Dicare était un site web consacré aux députés de la Ve République française. Un de ses objectifs était de mettre en valeur les projets Wikimedia, en montrant qu’il est possible de réutiliser leurs contenus dans d’autres projets. Ainsi, les données structurées de Dicare provenaient de la base de connaissance Wikidata et les images de la médiathèque Wikimedia Commons.

Dicare

Le cÅ“ur de Dicare était l’historique des mandats de la députés de la Ve République (avec pour chacun : député, législature, circonscription électorale, dates de début et de fin). Avant 2016, les informations de Wikidata sur ce sujet étaient parcellaires. Depuis son ouverture en avril 2016,  le site m’a permis de suivre mes contributions sur ce thème. À son terme, Dicare disposait de données sur plus de 2500 députés de la Ve République. Tous les mandats étaient renseignés pour la précédente et l’actuelle législatures (14e et 15e), ainsi que pour l’ensemble de la Ve République pour de nombreux départements (dont tous les départements bretons). En plus de cet historique, plusieurs sujets ont été explorés.

Le premier a été la longévité des députés à l’Assemblée nationale. Ainsi, on peut noter quelques records, à commencer par celui de Didier Julia, qui a passé plus de 44 ans sur les bancs de l’Assemblée. D’autres sont restés moins longtemps : une journée pour Catherine Pen, dans l’incapacité de remplacer la députée dont elle était suppléante ; quelques jours pour de nombreux suppléants, remplaçant des députés nommés au gouvernement juste avant les élections législatives ; etc.

Un deuxième sujet a été l’égalité femme-homme. Ce n’était évidemment pas une découverte, la parité n’a jamais existé à l’Assemblée nationale, même si elle s’améliore sensiblement depuis la 11e législature en 1997. Auparavant, il y avait toujours eu moins de 10 % de femmes à l’Assemblée nationale !

Un troisième sujet a été l’usage de Twitter par les députés. J’ai pris le temps, pour l’ensemble des députés des 14e et 15e législatures, de vérifier s’ils avaient un compte Twitter (et le cas échéant de le renseigner dans Wikidata). Voici, en février 2017, le nombre de députés avec un compte Twitter, en fonction de leur tranche d’âge :

Tranche d’âge Nombre de députés Avec un compte Twitter Part
39 ans et moins 17 17 100 %
40 — 49 ans 83 76 92 %
50 — 59 ans 181 158 87 %
60 — 69 ans 189 140 74 %
70 ans et plus 102 58 57 %

Suite aux élections législatives de juin 2017, les chiffres se sont nettement tassés (pour chaque tranche d’âge, au moins 70 % des députés avaient un compte Twitter), probablement car la communication sur les réseaux sociaux faisait partie des stratégies de campagne de bon nombre de candidats.

Enfin, un bot Twitter souhaitait chaque jour l’anniversaire des députés avec un mandat en cours, ce qui a pu mener à quelques discussions amusantes, comme celle-ci.

Qualité des données

Les données de Wikidata étaient importées ponctuellement dans Dicare. Plusieurs méthodes m’ont permis de m’assurer de la qualité des données présentes dans Wikidata avant ou après leur import dans Dicare :

À noter que le gadget checkConstraints de Wikidata est également excellent, mais utilisable sur un seul élément Wikidata à la fois (et avec toujours les limitations de l’expressivité des contraintes).

J’ai rapidement testé le moteur de règles JBoss Drools, mais il s’est avéré assez peu adapté dans mon cas : développement dans un second langage de programmation par rapport au site web, nécessité de répliquer le modèle de données, etc. L’usage du framework est plus approprié pour des problématiques plus complexes et avec une homogénéité dans les choix techniques.

Enfin, je n’ai pas encore eu le temps d’essayer ShEx.

Fin et suite

Le site a fermé en mars 2018. Toutefois, rien n’est perdu. Les données restent disponibles dans les projets Wikimedia : les données structurées dans Wikidata et les images dans Wikimedia Commons. Les projets Élus et Parliamants sont toujours actifs sur Wikidata. Par ailleurs, le code source du site est disponible sous licence libre (AGPLv3) sur GitHub. Les outils associés (Dicare Tools) sont toujours disponibles et leur code source est sur GitHub.