An overview of R libraries to query Wikidata

R logo

R is a free programming language for statistical computing. This post is an overview of libraries that query Wikidata and allow you to fetch data from it.

Libraries

WikidataR 1.4.0

Note: WikidataR has been forked and is actively maintained, now available in version 2.1.3. This version is not studied here yet, though it is interesting to note that it embarks several packages, including the following WikidataQueryServiceR.

WikidataR is the only R library that targets the Wikidata part of the Mediawiki API. While it is basic (not all features from the API are covered), it works well and nicely, with a neat documentation. You can get a specific item or a specific property with get_item and get_property, get random items or properties with get_random_item and get_random_property (with an optional parameter to fetch several elements at once), and find items and properties by their labels and aliases using find_item and find_property. WikidataR fully supports Wikibase data model, but is out of date, as lexicographical data is not yet implemented.

WikidataQueryServiceR 1.0.0

WikidataQueryServiceR is a R library that targets Wikidata Query Service (WDQS), the official SPARQL endpoint of Wikidata. It provides a simple function query_wikidata that returns the results of a query.

Since version 1.0.0, this library also provides a function named get_example to get queries from the examples page. It allows to scrap examples and to then use them with the function query_wikidata.

SPARQL package 1.16

SPARQL package is a generic R library that allows you to query any SPARQL endpoint. Its advantage over WikidataQueryServiceR is that you can use it to query several SPARQL endpoints and not only Wikidata’s one. Surprisingly, this library is much slower than WikidataQueryServiceR, even when your query returns only a few thousands of results.

Miscellaneous

This post only covers general-purpose libraries (I may have missed some!). More specific libraries that allow you to retrieve data from Wikidata exist, like wikitaxa for taxonomic data, or webchem for chemical data.

Examples

WikidataR 1.4.0

As the documentation of WikidataR is short and covers nearly everything, I’ll let you read it!

WikidataQueryServiceR 1.0.0

Install the library and load it:

> install.packages("WikidataQueryServiceR")
> library(WikidataQueryServiceR)

Query all video games with a publication date, keeping only the earliest date by video game (a video game can have various publication dates, depending on the platform or the geographical area of publishing):

> r <- query_wikidata('
    SELECT ?item ?itemLabel (MIN(?_date) AS ?date) (MIN(?_year) AS ?year) {
        ?item wdt:P31 wd:Q7889 ; wdt:P577 ?_date .
        BIND(YEAR(?_date) AS ?_year) .
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    GROUP BY ?item ?itemLabel
    HAVING (?year > 1)
')
25817 rows were returned by WDQS

Display the first ones:

> head(r)
                                  item                      itemLabel                 date year
1 http://www.wikidata.org/entity/Q2374               Civilization III 2001-10-30T00:00:00Z 2001
2 http://www.wikidata.org/entity/Q2377                Civilization IV 2005-10-25T00:00:00Z 2005
3 http://www.wikidata.org/entity/Q2385                 Civilization V 2010-09-21T00:00:00Z 2010
4 http://www.wikidata.org/entity/Q2387    Commandos 2: Men of Courage 2001-09-20T00:00:00Z 2001
5 http://www.wikidata.org/entity/Q2440 Freedom Force vs the 3rd Reich 2005-03-08T00:00:00Z 2005
6 http://www.wikidata.org/entity/Q2450    Heroes of Might and Magic V 2006-05-16T00:00:00Z 2006

Display the number of games published each year:

> barplot(table(r$year), col = "dodgerblue3", xlab = "year", ylab = "count")

SPARQL package 1.16

Install the library and load it:

install.packages("SPARQL")
library(SPARQL)

We use the same query as in the previous example and, as this library can query any SPARQL endpoint, we have to give it the URL of the Wikidata endpoint:

r <- SPARQL('https://query.wikidata.org/sparql','
    SELECT ?item ?itemLabel (MIN(?_date) AS ?date) (MIN(?_year) AS ?year) {
        ?item wdt:P31 wd:Q7889 ; wdt:P577 ?_date .
        BIND(YEAR(?_date) AS ?_year) .
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    GROUP BY ?item ?itemLabel
    HAVING (?year > 1)
',curl_args=list(useragent='User Agent Example'))

Display the first results (note that they are in r$results):

> head(r$results)
                                    item                           itemLabel       date year
1 <http://www.wikidata.org/entity/Q2374>               "Civilization III"@en 1004396400 2001
2 <http://www.wikidata.org/entity/Q2377>                "Civilization IV"@en 1130191200 2005
3 <http://www.wikidata.org/entity/Q2385>                 "Civilization V"@en 1285020000 2010
4 <http://www.wikidata.org/entity/Q2387>    "Commandos 2: Men of Courage"@en 1000936800 2001
5 <http://www.wikidata.org/entity/Q2440> "Freedom Force vs the 3rd Reich"@en 1110236400 2005
6 <http://www.wikidata.org/entity/Q2450>    "Heroes of Might and Magic V"@en 1147730400 2006

You may notice that, by default:

Still, you can display the same graph:

> barplot(table(r$results$year), col = "dodgerblue3", xlab = "year", ylab = "count")

Summary

WikidataR WikidataQueryServiceR SPARQL package
CRAN WikidataR WikidataQueryServiceR SPARQL
Repository github.com/Ironholds/WikidataR github.com/bearloga/WikidataQueryServiceR github.com/cran/SPARQL
Version 1.4.0 1.0.0 1.16
Release date 2017-09-22 2020-06-17 2013-10-25
Target Mediawiki API Wikidata Query Service any SPARQL endpoint
Features
  • get_item(id)
  • get_property(id)
  • get_random_item([limit])
  • get_random_property([limit])
  • find_item(term[,language=][,limit=])
  • find_property(term[,language=][,limit=])
  • query_wikidata(query)*
  • sparql(endpoint,query)*
Pros
  • simple and effective
  • full support of Wikibase data model (but not up to date)
  • simple and effective
  • has the power of SPARQL queries
  • has the power of SPARQL queries
  • one library to query any SPARQL endpoint
Cons
  • no support of lexemes
  • slow

(*) Both WikidataQueryServiceR and SPARQL package have options to change format behavior, not covered here.

So, what library should you use? It depends on your needs:

Update February 2022: this post has been updated to reflect new releases of WikidataR and WikidataQueryServiceR.


R logo by The R Foundation, CC BY-SA 4.0.