A short history of Denelezh

A few weeks ago, Gnom was wondering what is the actual size of the gender gap:

Do we have an estimate of the size of the gender gap on Wikipedia given the current notability criteria? In other words, if every notable living or dead person had an article, what percentage would be about women? For example, would 30% women biographies be a good ballpark for a ‘closed’ Wikipedia gender gap?

This was my answer, giving some insights about the history of Denelezh, a tool that provides statistics about the gender gap in the content of Wikimedia projects:

Your question is the one I had in mind when I built the first version of Denelezh. My idea was to split the problem into smaller ones and to work by subsets to facilitate the study. There are subsets where the percentage of women is perfectly known. For instance, here are the percentages of women in the lower house of the French Parliament since 1945. In the Wikipedia in French language, members of parliaments automatically comply with notability criteria. Other subsets have to be studied. The tool partially allows to do so by providing statistics about humans depicted in Wikidata along various dimensions that you can combine (for instance country of citizenship + occupation = French politicians). But merging these subsets is not easy, as they are overlapping: you can’t simply add the percentages from two sets when some people belong to both. An athlete can also be a politician, someone can have several countries of citizenship, and so long…

The first version of the tool provided statistics about humans in Wikidata with a gender, a year of birth, and a country of citizenship. The assumption (unverified) was that Wikidata items with all these properties were of better quality than the ones with one or more missing properties. The problem is that statistics were only about 50% of humans depicted in Wikidata, and thus were misleading for people studying the gender gap in Wikimedia projects.

The second version, which development was rushed, solved this problem by no longer filtering Wikidata items on the number of properties they have and providing statistics as long as the data was available. With this change (and the addition of Wikimedia projects as an available dimension to filter / combine), it became closer to WHGI. One current problem is that the tool lack statistics on Wikidata quality (for instance, how many Wikidata items depicting humans have the property gender?).

The third version will be a merge of Denelezh and WHGI. Some ideas are already in the pipeline (adding external IDs as a dimension, producing lists of notable people to help Wikimedia editors to find subjects to work on, etc.), some others are on Phabricator. Feedback welcome 🙂