Exploring Wikidata properties by the similarity of their use

A few weeks ago, I released Related Properties, a tool to explore Wikidata properties and find the ones used together.

Features overview

The first idea when you want to find which properties are the most used with one specific property in Wikidata is to look at the cardinality of intersection, i.e. the number of items that use both properties. The issue with this method is that it will mainly returns general properties. For instance, when you look at the closest properties of archives at (P485) sorted by the cardinality of intersection, you have a bunch of general properties about humans (sex or gender, occupation, given name, …).

Another idea is to use the Jaccard index, which is the cardinality of intersection divided by the cardinality of union of two sets. It allows to find properties that are used mainly together and not on differing sets of items. With the same example of archives at (P485), we can see that the closest properties sorted by the Jaccard index are quite different, with mostly external IDs from authorities.

In a nutshell:

The tool unveils closest properties by both methods. Each property is displayed with its English label and its P number, and is also linked to its page on Wikidata. Properties can be filtered by type, for example to gather statistics about external ids only. The data can be downloaded from the main page of the tool.

Limits

At the moment, statistics are limited to:

Other methods to detect similarity should be available. For instance, the fact that P4285 is (or should be) a subset of P269 is not clearly visible at the moment.

Note: the idea to use the Jaccard index comes from Goran S. Milovanović (T214897).

Technical overview

The tool relies on the weekly Wikidata JSON dump, which is read in a one-time pass with the Wikidata Toolkit, to compute the cardinality of each property and the cardinality of intersection of each pair of properties. The data is then imported into a MySQL database to compute the Jaccard index and to easily display the data with PHP.

Here is a description of the algorithm and its main variables used to generate the statistics:

set p_s to an empty set of pairs;
set q_s to an empty set of 4-tuples;
for each item in the Wikidata JSON dump:
    set u_s to an empty set of singletons;
    for each statement with normal or preferred rank in the item:
        set p the main property used in the statement;
        if p not in u_s:
            add p to u_s;
    for each property pa in u_s:
        if (pa, _) not in p_s:
            add (pa, 0) to p_s;
        set (pa, n) to (pa, n + 1) in p_s;
        for each property pb in u_s:
            if pa < pb:
                if (pa, pb, _, _) not in q_s:
                    add (pa, pb, 0, _) to q_s;
                set (pa, pb, n, _) to (pa, pb, n + 1, _) in q_s;
for each tuple (pa, pb, i, _) in q_s:
    get (p, c_a) from p_s where p = pa;
    get (p, c_b) from p_s where p = pb;
    set (pa, pb, i, _) to (pa, pb, i, i / (c_a + c_b - i)) in q_s;

Lines 1 to 17 of the pseudocode are implemented by the class PropertiesProcessor.

Lines 18 to 21 of the pseudocode are implemented by the SQL query at the line 31 of the import script.