Exploring Wikidata properties by the similarity of their use
A few weeks ago, I released Wikidata Related Properties, a tool to explore Wikidata properties and find the ones used together.
The first idea when you want to find which properties are the most used with one specific property in Wikidata is to look at the cardinality of intersection, i.e. the number of items that use both properties. The issue with this method is that it will mainly returns general properties. For instance, when you look at the closest properties of archives at (P485) sorted by the cardinality of intersection, you have a bunch of general properties about humans (sex or gender, occupation, given name, …).
Another idea is to use the Jaccard index, which is the cardinality of intersection divided by the cardinality of union of two sets. It allows to find properties that are used mainly together and not on differing sets of items. With the same example of archives at (P485), we can see that the closest properties sorted by the Jaccard index are quite different, with mostly external IDs from authorities.
In a nutshell:
- the sort by cardinality of intersection allows you to find general properties;
- the sort by Jaccard index allows you to find more domain-specific properties.
The tool unveils closest properties by both methods. Each property is displayed with its English label and its P number, and is also linked to its page on Wikidata. Properties can be filtered by type, for example to gather statistics about external ids only. The data can be downloaded from the main page of the tool.
At the moment, statistics are limited to:
- properties used as main properties of statements (not as qualifiers or in references);
- main (Q) and property (P) namespaces, and don’t include lexicographical data, as lexemes are excluded from Wikidata JSON dumps for an unknown reason (T195419, T220883).
Note: the idea to use the Jaccard index comes from Goran S. Milovanović (T214897).
The tool relies on the weekly Wikidata JSON dump, which is read in a one-time pass with the Wikidata Toolkit, to compute the cardinality of each property and the cardinality of intersection of each pair of properties. The data is then imported into a MySQL database to compute the Jaccard index and to easily display the data with PHP.
Here is a description of the algorithm and its main variables used to generate the statistics:
p_sis the list of all Wikidata properties; each element is a pair
pthe id of the property and
cthe cardinality of the property (i.e. the number of distinct Wikidata items that use it).
q_sis the list of all pairs of Wikidata properties; each element is a 4-tuple
(pa, pb, i, j)with
pbthe ids of the properties,
ithe cardinality of intersection (i.e. the number of distinct Wikidata items that use both properties), and
jthe Jaccard index (i.e. the number of distinct Wikidata items that use both properties divided by the number of distinct Wikidata items that use at least one of the properties).
u_sis the list of properties used in a Wikidata item.
set p_s to an empty set of pairs; set q_s to an empty set of 4-tuples; for each item in the Wikidata JSON dump: set u_s to an empty set of singletons; for each statement with normal or preferred rank in the item: set p the main property used in the statement; if p not in u_s: add p to u_s; for each property pa in u_s: if (pa, _) not in p_s: add (pa, 0) to p_s; set (pa, n) to (pa, n + 1) in p_s; for each property pb in u_s: if pa < pb: if (pa, pb, _, _) not in q_s: add (pa, pb, 0, _) to q_s; set (pa, pb, n, _) to (pa, pb, n + 1, _) in q_s; for each tuple (pa, pb, i, _) in q_s: get (p, c_a) from p_s where p = pa; get (p, c_b) from p_s where p = pb; set (pa, pb, i, _) to (pa, pb, i, i / (c_a + c_b - i)) in q_s;
Lines 1 to 17 of the pseudocode are implemented by the class PropertiesProcessor.
Lines 18 to 21 of the pseudocode are implemented by the SQL query at the line 31 of the import script.