An overview of R libraries to query Wikidata

by Envel Le Hir — CC BY-SA 4.0 — January 27, 2019 — #rstat, Mediawiki, R, SPARQL, Wikidata

R is a free programming language for statistical computing. This post is an overview of libraries that query Wikidata and allow you to fetch data from it.

Libraries
Examples
Summary

Libraries

WikidataR 1.4.0

Note: WikidataR has been forked and is actively maintained, now available in version 2.1.3. This version is not studied here yet, though it is interesting to note that it embarks several packages, including the following WikidataQueryServiceR.

WikidataR is the only R library that targets the Wikidata part of the Mediawiki API. While it is basic (not all features from the API are covered), it works well and nicely, with a neat documentation. You can get a specific item or a specific property with get_item and get_property, get random items or properties with get_random_item and get_random_property (with an optional parameter to fetch several elements at once), and find items and properties by their labels and aliases using find_item and find_property. WikidataR fully supports Wikibase data model, but is out of date, as lexicographical data is not yet implemented.

WikidataQueryServiceR 1.0.0

WikidataQueryServiceR is a R library that targets Wikidata Query Service (WDQS), the official SPARQL endpoint of Wikidata. It provides a simple function query_wikidata that returns the results of a query.

Since version 1.0.0, this library also provides a function named get_example to get queries from the examples page. It allows to scrap examples and to then use them with the function query_wikidata.

SPARQL package 1.16

SPARQL package is a generic R library that allows you to query any SPARQL endpoint. Its advantage over WikidataQueryServiceR is that you can use it to query several SPARQL endpoints and not only Wikidata’s one. Surprisingly, this library is much slower than WikidataQueryServiceR, even when your query returns only a few thousands of results.

Miscellaneous

This post only covers general-purpose libraries (I may have missed some!). More specific libraries that allow you to retrieve data from Wikidata exist, like wikitaxa for taxonomic data, or webchem for chemical data.

Examples

WikidataR 1.4.0

As the documentation of WikidataR is short and covers nearly everything, I’ll let you read it!

WikidataQueryServiceR 1.0.0

Install the library and load it:

> install.packages("WikidataQueryServiceR")
> library(WikidataQueryServiceR)

Query all video games with a publication date, keeping only the earliest date by video game (a video game can have various publication dates, depending on the platform or the geographical area of publishing):

> r <- query_wikidata('
    SELECT ?item ?itemLabel (MIN(?_date) AS ?date) (MIN(?_year) AS ?year) {
        ?item wdt:P31 wd:Q7889 ; wdt:P577 ?_date .
        BIND(YEAR(?_date) AS ?_year) .
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    GROUP BY ?item ?itemLabel
    HAVING (?year > 1)
')
25817 rows were returned by WDQS

Display the first ones:

> head(r)
                                  item                      itemLabel                 date year
1 http://www.wikidata.org/entity/Q2374               Civilization III 2001-10-30T00:00:00Z 2001
2 http://www.wikidata.org/entity/Q2377                Civilization IV 2005-10-25T00:00:00Z 2005
3 http://www.wikidata.org/entity/Q2385                 Civilization V 2010-09-21T00:00:00Z 2010
4 http://www.wikidata.org/entity/Q2387    Commandos 2: Men of Courage 2001-09-20T00:00:00Z 2001
5 http://www.wikidata.org/entity/Q2440 Freedom Force vs the 3rd Reich 2005-03-08T00:00:00Z 2005
6 http://www.wikidata.org/entity/Q2450    Heroes of Might and Magic V 2006-05-16T00:00:00Z 2006

Display the number of games published each year:

> barplot(table(r$year), col = "dodgerblue3", xlab = "year", ylab = "count")

SPARQL package 1.16

Install the library and load it:

install.packages("SPARQL")
library(SPARQL)

We use the same query as in the previous example and, as this library can query any SPARQL endpoint, we have to give it the URL of the Wikidata endpoint:

r <- SPARQL('https://query.wikidata.org/sparql','
    SELECT ?item ?itemLabel (MIN(?_date) AS ?date) (MIN(?_year) AS ?year) {
        ?item wdt:P31 wd:Q7889 ; wdt:P577 ?_date .
        BIND(YEAR(?_date) AS ?_year) .
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    GROUP BY ?item ?itemLabel
    HAVING (?year > 1)
',curl_args=list(useragent='User Agent Example'))

Display the first results (note that they are in r$results):

> head(r$results)
                                    item                           itemLabel       date year
1 <http://www.wikidata.org/entity/Q2374>               "Civilization III"@en 1004396400 2001
2 <http://www.wikidata.org/entity/Q2377>                "Civilization IV"@en 1130191200 2005
3 <http://www.wikidata.org/entity/Q2385>                 "Civilization V"@en 1285020000 2010
4 <http://www.wikidata.org/entity/Q2387>    "Commandos 2: Men of Courage"@en 1000936800 2001
5 <http://www.wikidata.org/entity/Q2440> "Freedom Force vs the 3rd Reich"@en 1110236400 2005
6 <http://www.wikidata.org/entity/Q2450>    "Heroes of Might and Magic V"@en 1147730400 2006

You may notice that, by default:

URLs are surrounded by brackets;
labels contain language codes;
dates are parsed as timestamps.

Still, you can display the same graph:

> barplot(table(r$results$year), col = "dodgerblue3", xlab = "year", ylab = "count")

Summary

	WikidataR	WikidataQueryServiceR	SPARQL package
CRAN	WikidataR	WikidataQueryServiceR	SPARQL
Repository	github.com/Ironholds/WikidataR	github.com/bearloga/WikidataQueryServiceR	github.com/cran/SPARQL
Version	1.4.0	1.0.0	1.16
Release date	2017-09-22	2020-06-17	2013-10-25
Target	Mediawiki API	Wikidata Query Service	any SPARQL endpoint
Features	`get_item(id)` `get_property(id)` `get_random_item([limit])` `get_random_property([limit])` `find_item(term[,language=][,limit=])` `find_property(term[,language=][,limit=])`	`query_wikidata(query)`*	`sparql(endpoint,query)`*
Pros	simple and effective full support of Wikibase data model (but not up to date)	simple and effective has the power of SPARQL queries	has the power of SPARQL queries one library to query any SPARQL endpoint
Cons	no support of lexemes		slow

(*) Both WikidataQueryServiceR and SPARQL package have options to change format behavior, not covered here.

So, what library should you use? It depends on your needs:

if you only need to get a few specific items from Wikidata, use WikidataR;
if you need to do more tedious work, like complex search to retrieve numerous results, use WikidataQueryServiceR;
if you need to query several SPARQL endpoints, use SPARQL package.

Update February 2022: this post has been updated to reflect new releases of WikidataR and WikidataQueryServiceR.

R logo by The R Foundation, CC BY-SA 4.0.

An overview of R libraries to query Wikidata

Libraries

WikidataR 1.4.0

WikidataQueryServiceR 1.0.0

SPARQL package 1.16

Miscellaneous

Examples

WikidataR 1.4.0

WikidataQueryServiceR 1.0.0

SPARQL package 1.16

Summary

Search

Last Posts

Wikibase

Some projects