The WikidataCon Card Game

My present to Wikidata for its 7th birthday is a card game, a way to thank Wikimedians, individually and globally.

Inception

In May 2019, a week-end about Wikimedia volunteering and local groups took place at Wikimédia France, organized by Rémy Gerbet and facilitated by Olivier Taieb. Except for a short preparation, Olivier didn’t really know the Wikimedia movement before starting, but he quickly settled in and his animation over the week-end was wonderful and the results were more than useful. One of his early feedback was that we, Wikimedians, like to auto-flagellate and have difficulty to congratulate ourselves. In an unfinished discussion, we started to explore ways to thank other Wikimedians for their work. This topic is not new, but, in my opinion, still at its beginning. For instance, the first analysis about the Thank feature on Wikimedia projects only popped up a month ago. The first edition of Coolest Tool Award was held this year, even though the Wikimedia movement is more than 18 years old and has produced hundred of tools.

At Wikimania, I remember having difficulty to start a conversation with someone to compliment them for their work. So, in the following weeks, I looked into a way to do it more easily. For WikidataCon 2019, one of my ideas was to make cards with the expression “thank you”, each card in a different language (as languages were the theme of this edition). However, random thanks don’t facilitate the start of a conversation and are probably not very efficient. So I decided to build a full deck of specific thanks, inspired, among other things, by the Wikidata Card Game Generator made by bleeptrack and blinry, and the WikidataCon Award organized by Birgit Müller.

Concept

A deck of 60 different cards was built. Each card represents something related to Wikimedia, Wikidata or Germany (where the conference took place). It is composed of an image and the id of the concept in Wikidata. Each attendee of the conference received a card holder (this was planned before the idea of the card game came up) with a random card in it. As there were 250 attendees, several identical decks were printed, and several participants received the same card. Each card could be used as an icebreaker, as an attendee could:

And I cheated. I printed an additional deck and thanked some participants for their work, using specific cards as icebreakers and dedicated thanks. Yes, I missed some people (even among people I talked with) 🙁

These cards are also a way to thank everyone, Wikimedians or not, contributing to the projects cited on the cards.

Content

Here are some insights about the content of the game. Sadly, choices had to be made in a short time, and a lot of cool things are not in the deck. It also lacks of Easter eggs and random items, from everyday life or from Germany. Maybe for the next edition of WikidataCon, with more time to prepare a better deck?

Some photos: Léa ~ Auregann, Caroline Becker, Petit Tigre, Pierre-Selim Huard, P-Y

Making-of

First, a great thank to Jean-Frédéric and Léa for their help, and to Wikimedia Deutschland for the adoption of the idea 🙂

I had the idea only a few weeks before the conference. The content was chosen and the design of the cards generated in less than ten days. It was fun but intense. One difficulty was that a lot of cool things don’t have a logo or an image under free license to depict them. All the images come from Wikimedia Commons, with only one exception, all under free licenses (including a lot in public domain or under CC0). The full list of credits is available on Wikidata.

Card designs were generated using a PHP script, after editing some of the images with Gimp (for example to remove most of the text from logos and make them harder to identify). The library used was ImageMagik, with excellent results. PHP GD, which does basically the same thing, was tested but produced images of very poor quality and with a lot of artifacts. The source code is available under CC0 license (public domain).

The choice of the printer was not ideal (a company based in Hong Kong, with the cards having to travel half the world for the conference…), but their price was fair, they had nice recommendations and a user-friendly editor, and they could deliver in the short deadline.

To be continued…

A lot of variations were suggested during the event:

Feel free to recycle the ideas!

Exploring Wikidata properties by the similarity of their use

A few weeks ago, I released Related Properties, a tool to explore Wikidata properties and find the ones used together.

Features overview

The first idea when you want to find which properties are the most used with one specific property in Wikidata is to look at the cardinality of intersection, i.e. the number of items that use both properties. The issue with this method is that it will mainly returns general properties. For instance, when you look at the closest properties of archives at (P485) sorted by the cardinality of intersection, you have a bunch of general properties about humans (sex or gender, occupation, given name, …).

Another idea is to use the Jaccard index, which is the cardinality of intersection divided by the cardinality of union of two sets. It allows to find properties that are used mainly together and not on differing sets of items. With the same example of archives at (P485), we can see that the closest properties sorted by the Jaccard index are quite different, with mostly external IDs from authorities.

In a nutshell:

The tool unveils closest properties by both methods. Each property is displayed with its English label and its P number, and is also linked to its page on Wikidata. Properties can be filtered by type, for example to gather statistics about external ids only. The data can be downloaded from the main page of the tool.

Limits

At the moment, statistics are limited to:

Other methods to detect similarity should be available. For instance, the fact that P4285 is (or should be) a subset of P269 is not clearly visible at the moment.

Note: the idea to use the Jaccard index comes from Goran S. Milovanović (T214897).

Technical overview

The tool relies on the weekly Wikidata JSON dump, which is read in a one-time pass with the Wikidata Toolkit, to compute the cardinality of each property and the cardinality of intersection of each pair of properties. The data is then imported into a MySQL database to compute the Jaccard index and to easily display the data with PHP.

Here is a description of the algorithm and its main variables used to generate the statistics:

set p_s to an empty set of pairs;
set q_s to an empty set of 4-tuples;
for each item in the Wikidata JSON dump:
    set u_s to an empty set of singletons;
    for each statement with normal or preferred rank in the item:
        set p the main property used in the statement;
        if p not in u_s:
            add p to u_s;
    for each property pa in u_s:
        if (pa, _) not in p_s:
            add (pa, 0) to p_s;
        set (pa, n) to (pa, n + 1) in p_s;
        for each property pb in u_s:
            if pa < pb:
                if (pa, pb, _, _) not in q_s:
                    add (pa, pb, 0, _) to q_s;
                set (pa, pb, n, _) to (pa, pb, n + 1, _) in q_s;
for each tuple (pa, pb, i, _) in q_s:
    get (p, c_a) from p_s where p = pa;
    get (p, c_b) from p_s where p = pb;
    set (pa, pb, i, _) to (pa, pb, i, i / (c_a + c_b - i)) in q_s;

Lines 1 to 17 of the pseudocode are implemented by the class PropertiesProcessor.

Lines 18 to 21 of the pseudocode are implemented by the SQL query at the line 31 of the import script.

Matching BnF and Wikidata video games using Dataiku DSS

Vous pouvez lire ce billet en français : Alignement des jeux vidéo de la BnF avec Wikidata grâce à Dataiku DSS.


While the National Library of France (Bibliothèque nationale de France, BnF) collections obviously comprise numerous books, newspapers or manuscripts, they also contain elements from more recents technologies, like video games. According to its SPARQL endpoint, the BnF’s general catalogue has information on over 4,000 video games. At the same time, Wikidata has information about 36,000 video games, with only 60 linked to a BnF record in early February 2019!

In this blog post, we will see how we can improve this linking, using the software Dataiku Data Science Studio (Dataiku DSS), the objective being to correctly add the maximum number of BnF ids to video games items on Wikidata.

Dataiku DSS installation

The installation and the use of Dataiku DSS are outsite the scope of this post; however, here is some information that can be useful to start with this tool.

You can download the free edition of Dataiku DSS, which is more than enough in our case, from Dataiku website, following the instructions corresponding to your environment (for example, I use Dataiku DSS in the provided virtual machine, and access it through my favorite web browser).

You need to install the SPARQL plugin (presented in this post):

On the usage of Dataiku DSS, the two first tutorials offered by Dataiku should be sufficient to understand this post.

For the next steps, it is assumed that you have created a new project in Dataiku DSS.

Data from Wikidata

Data import

From Wikidata, we import the video games which have a publication date (by keeping only the oldest one for each game) and do not yet have a BnF id.

In the Flow of your project, add a new dataset Plugin SPARQL. Enter the URL of the SPARQL endpoint of Wikidata:

https://query.wikidata.org/sparql

Then the following SPARQL query:

SELECT ?item ?itemLabel (MIN(?year) AS ?year) {
  ?item wdt:P31 wd:Q7889 ; wdt:P577 ?date .
  BIND(YEAR(?date) AS ?year) .
  FILTER NOT EXISTS { ?item wdt:P268 [] } .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr" . }
}
GROUP BY ?item ?itemLabel
HAVING (?year > 1)

Retrieve the data. Feel free to play with it, for example by displaying the number of video games by year of publication.

Data preparation

Using the Dataiku DSS recipe Prepare, we prepare the retrieved data.

The main steps of the preparation are:

Data from BnF

Data import

Similarly, we import video games from BnF. Create a new dataset Plugin SPARQL, with the URL of the SPARQL endpoint of the BnF catalogue:

https://data.bnf.fr/sparql

With the following SPARQL query:

SELECT DISTINCT ?item ?itemLabel ?year
WHERE {
  ?item <http://xmlns.com/foaf/0.1/focus> ?focus ; <http://www.w3.org/2004/02/skos/core#prefLabel> ?label .
  ?focus <http://data.bnf.fr/ontology/bnf-onto/subject> "Informatique" ;
    <http://data.bnf.fr/ontology/bnf-onto/subject> "Sports et jeux" ;
    <http://data.bnf.fr/ontology/bnf-onto/firstYear> ?year .
  FILTER NOT EXISTS { ?focus <http://purl.org/dc/terms/description> "Série de jeu vidéo"@fr . } .
  FILTER NOT EXISTS { ?focus <http://purl.org/dc/terms/description> "Série de jeux vidéo"@fr . } .
  BIND(STR(?label) AS ?itemLabel) .
}

Note that, unlike Wikidata, what is a video game is only indirectly determined in BnF records. We also need to filter the video game series by two different filters due to the inconsistencies in the BnF data.

Data preparation

Labels of video games from the BnF’s catalogue finish by the string : jeu vidéo. We need to remove it so the two data sets can correctly match. For that, in the preparation recipe, we add a step to replace this string by an empty string (as for the preparation of the id).

Then, we proceed as for Wikidata:

First filtering

We want to filter BnF ids that are already present in Wikidata, in order to prevent a BnF id from being used on several Wikidata elements.

To do so, we start by retrieving BnF ids existing in Wikidata, using a new dataset Plugin SPARQL, with the address of the SPARQL endpoint of Wikidata and the query that returns all used BnF ids in Wikidata:

SELECT DISTINCT ?bnf_id_to_filter WHERE { [] wdt:P268 ?bnf_id_to_filter . }

Then, we join the two datasets bnf_video_games_prepared and bnf_ids using the recipe Join with. The default engine of Dataiku DSS can’t do a full outer join directly to filter out at once the lines that we want to remove. The trick is to do a left outer join. We start by keeping all the lines from the first dataset, by possibly retrieving the information from the second dataset if a match exists (here, we only retrieve the identifier in the second dataset):

Then, in the Post-filter part of the recipe, we keep only the lines from the first dataset for which no match was found, i. e. for which the column bnf_id_to_filter is empty after the join.

The result of the recipe is the set of video games from BnF, retrieved and prepared, filtered from those with an id already present in Wikidata.

Data matching

Using the recipe Join with, we combine the two prepared datasets into one unique dataset. We use an inner join with two join conditions:

The normalization of labels allows us to match labels that have only a few differences in their format (capital letters, accents, spaces, etc.) between BnF and Wikidata.

The year of publication makes it possible to distinguish video games with identical titles but released several years apart. For example, there are two games called Doom: one published in 1993 and the other one in 2016.

Data cleaning

When you look at the data, you can see that there are duplicates. For example, there are several Hook video games released in 1992, with 3 ids in Wikidata and 2 in BnF. BnF also seems to have 2 records on the same game, Parasite Eve, released in 1998. After verification, it appears that the series is not categorized as such in the BnF catalogue. Rather than inserting incorrect data, we filter it out.

We start by building two datasets, each representing games having their id (respectively from BnF and Wikidata) appearing exactly one time after the matching. To do so, we use the recipe Group, grouping video games by id.

In the Post-filter part of each recipe Group, we keep only ids that appear exactly one time.

Then, we do two successive intern joins from the combined data (video_games_joined) with the two datasets (video_games_grouped_by_bnf_id and video_games_grouped_by_wd_id) that we just created. It is then certain that the retained lines no longer contain duplicates in the identifiers.

Data import into Wikidata

After the cleaning, we want to import the data into Wikidata. For that, we put them in CSV format, as expected by QuickStatements:

Dataiku DSS allows you to export data in CSV format. The last step is to copy-paste the content of the exported file into QuickStatements, which gives us:

Outcomes

In a few clicks, we imported more than 2,000 BnF ids of video games into Wikidata. However, the work is not finished!