Denelezh 2.0, a transitional version

by Envel Le Hir — CC BY 4.0 — April 17, 2018 — Gender gap, Wikidata, Wikipedia

At the beginning of April, a new version of Denelezh, a tool to explore the gender gap in the content of Wikadata and Wikimedia projects, was released. This post explains what led to this new version, including the choice of a new methodology to generate the metrics, and what you can expect in future releases. Finally, a technical overview of the tool is provided.

What’s new

A 4th dimension

Since its inception, Denelezh provides multidimensional analysis. You can explore the gender gap in Wikidata by several dimensions: the year of birth, the country of citizenship, and the occupation of a human. It is possible to combine these dimensions, for example to have metrics on the gender gap for French politicians born in 1901.

The most visible improvement in this new version of Denelezh is the addition of a fourth dimension: the Wikimedia project. All projects that have at least one page about a human according to Wikidata are included: not only the English Wikipedia, but also the young Atikamekw Wikipedia, Wikimedia Commons, Wikispecies, the Polish Wikiquote, …

Data is still extracted from Wikidata using its weekly dump. Thus, you can go back in time to observe the evolution of metrics you are interested in. For example, the French Wikipedia had 16.0 % of its biographies about women in January 2017, 16.3 % in July 2017, 16.6 % in January 2018, and is now, in April 2018, at 16.8 %. It seems encouraging but, in the meantime, 33,107 biographies about men were added in the French Wikipedia and only 11,202 about women.

A new methodology

Although it is less visible, the most important improvement in this version is the new methodology to generate the statistics. The idea is to generalize the statistics produced in the first version of the tool.

In the previous version, only around 50 % of humans in Wikidata were kept, mainly in the hope that keeping only humans with all studied dimensions would improve the quality of the metrics provided by the tool. The problem is that this hypothesis was never confirmed (nor contradicted). Now, all available data is used, and in particular:

Items with several best values for the property instance of are not discarded anymore (they are representing humans as long as one of these values is human).
Humans born before 1600 are not discarded anymore.
All normal and preferred values of the property country of citizenship are used; humans are not limited to one country of citizenship anymore.

The tool does not try anymore to provide statistics about biases by introducing new biases 🙂

Other improvements include:

Parent occupations are deduced using the property subclass of. In this way, the tool directly provides metrics about general occupations, like politician or scientist.
All sitelinks are used, and not only a (buggy) subset of them.

Future

Main features

Even if they need to be clearly defined, the main new features will be:

the addition of a fifth dimension (external IDs, but it is still needed to check that it will possible),
the introduction of charts to track changes over time,
the ability to compare sets.

It will also be the transformation of Denelezh into a more general tool, as explained in the next section.

Data quality

I already worked on data quality in Wikidata, for example by cleaning BnF IDs (in French) or by contributing to Wikidata about the members of the French parliament with Dicare (in French). In this last case, a dedicated dashboard (in French) provides statistics on the data held by Wikidata about members of the French National Assembly, legislature by legislature, and insights on what needs to be improved.

The idea is to provide, with Denelezh, a general dashboard to help Wikimedians to contribute about humans, with not only data on the gender gap but also other metrics, like missing properties (number of people without a gender, without a date of birth, …).

Usability

Usability is an important topic that needs to be covered. For example, the form needs to be more understandable and to have a dedicated documentation. A lot of little things can drastically improve the tool, like to provide links to Wikidata items, links to the Wikidata Query Service to have live results, exports in CSV format… Finally, Denelezh needs internationalization: it’s quite ironic to have an application about gaps only available in English!

Technical overview

The source code is available on denelezh-core and denelezh-import.

Architecture

The tool is still divided in three parts:

A Java project, using the Wikidata Toolkit (0.8.0), to extract data from Wikidata weekly JSON dumps. It generates several CSV files in a format intended to be easily loaded into a relational database.
A PHP script, which purpose is to generate and execute SQL queries. These queries load the data from the CSV files into a MySQL database and compute it (in summary, it aggregates the data for the multidimensional analysis).
A website, to display the data.

In order to have reproducible results, the Wikidata Query Service is not used anymore (it was only used for labels in the previous version).

Some metrics

Denelezh is installed on a dedicated server with an i5-3570S CPU, 16 GB of RAM, and a slow hard disk, running Debian 8 (Jessie) as the operating system, nginx as the web server, and MySQL 5.7 as the relational database. The processing of the most recent dump (2018-04-09) takes around 11 hours:

about 4 hours and 15 minutes for downloading the dump,
more than 2 hours for the processing of the dump by Wikidata Toolkit,
about 4 hours and 30 minutes for the loading and computing into MySQL.

From this dump, 29,338,817 sets with at least one human were generated. The corresponding MySQL data file is about 2.7 GB. Data from each dump is stored in a separate MySQL partition to improve performance and to ease maintenance.

Feedbacks

Feel free to send feedbacks, by email: envel -at- lehir.net

The following list is a synthesis of possible evolutions of the tool, collected from the Wikimedia community, both online (including on Wikimedia projects like Wikidata, English Wikipedia, French Wikisource, etc. but also on social networks like Twitter or Telegram) and offline (for instance at Wikimania, WikidataCon, volunteers meetings, etc.).

Features

Type	Name	Description
Core feature	✓ Evolution	Each metric should be traceable over time. Example: evolution of the gender gap on English Wikipedia for the last two years, with a value every month.
Core feature	Comparison	The metrics from two sets should be comparable. Example: compare occupations from two Wikipedias.
Metrics	Base metrics	The following statistics should be available for each set: total number of humans number of humans with exactly one gender [already exists] number of females, males, and others [already exists] number of humans with at least / exactly one distinct year of birth (at preferred rank) number of humans with at least / exactly one distinct year of death (at preferred rank) number of humans with at least / exactly one distinct place of birth (at preferred rank) number of humans with at least / exactly one distinct place of death (at preferred rank) number of humans with at least one country of citizenship (at preferred rank) number of humans with at least one occupation (at preferred rank) number of humans with at least one image (at preferred rank) number of humans with at least one given name (at preferred rank) number of humans with at least one family name (at preferred rank)
Metrics	External ID metrics	External ID should be another available dimension (in addition to year of birth, country of citizenship, occupation, and project). Note: very expensive, may need optimization (probably architecture change) or features limitations (removal of another dimension, drill down limited to N levels, …).
Metrics	Fictional content	Exploration of instances of fictional human (Q15632617), in addition to human (Q5), should be possible.
UI/UX	Internationalization	The application should be available in more languages than just English.
UI/UX	Sub-occupations	Display sub-occupations deduced by the subclass of property to facilitate the study of a set and drill down.
UI/UX	Set of projects	Metrics should be available by set of projects (i.e. all Wikipedias, all Wikisources, …).
UI/UX	Time intervals	Improve time intervals display: show metrics year by year, decade by decade, century by century, etc. in accordance with the size of the time interval chosen.
UI/UX	✓ Links to Wikidata	Each concept should be linked to its Wikidata item.
UI/UX	✓ Links to projects	Each project should be linked when cited, with a friendly name (not its code).
UI/UX	Links to WDQS	Each set should be linked to a SPARQL query to retrieve live data.
API	API	Provide an API so the data could be used by third-parties.
API	Export	Statistics should be available in CSV format.
Other	Cleaning	Provide a list of barely used countries and occupations, so Wikidata contributors can clean them (there are many errors in them).

Technical

Category	Name	Description
Bug	Graphics artifacts	Sometimes, there is a blank area between the orange (female) and green (male) areas, even when there is no human with other gender in the set.
Optimization	✓ Database schema optimization	In the database, BIGINT should be replaced by INT in many places (in Wikidata, the ids are far from the INT limit of 4,294,967,295).
Optimization	Labels import optimization	Only load into database the labels that are used (at the end of the Wikidata Toolkit job, generate a second label.csv file with only useful labels before loading them into database).