How to query Wikidata in R

This post is an overview of R libraries that query Wikidata and allow you to fetch data from it.

Libraries

WikidataR 1.4.0

WikidataR is the only R library that targets the Wikidata part of the Mediawiki API. While it is basic (not all features from the API are covered), it works well and nicely, with a neat documentation. You can get a specific item or a specific property with get_item and get_property, get random items or properties with get_random_item and get_random_property (with an optional parameter to fetch several elements at once), and find items and properties by their labels and aliases using find_item and find_property. WikidataR fully supports Wikibase data model, but is out of date, as lexicographical data is not yet implemented.

WikidataQueryServiceR 0.1.1

WikidataQueryServiceR is a R library that targets Wikidata Query Service (WDQS), the official SPARQL endpoint of Wikidata. It provides a simple function query_wikidata that returns the results of a query.

This library also provides a function named scrape_example to get queries from the examples page. Unfortunately, it does not work properly and can returns unwanted results. In the development version of WikidataQueryServiceR, scrape_example is replaced by get_example, but currently does not work at all.

SPARQL package 1.16

SPARQL package is a generic R library that allows you to query any SPARQL endpoint. Its advantage over WikidataQueryServiceR is that you can use it to query several SPARQL endpoints and not only Wikidata’s one. Surprisingly, this library is much slower than WikidataQueryServiceR, even when your query returns only a few thousands of results.

Miscellaneous

This post only covers general-purpose libraries (I may have missed some!). More specific libraries that allow you to retrieve data from Wikidata exist, like wikitaxa for taxonomic data, or webchem for chemical data.

Examples

WikidataR

As the documentation of WikidataR is short and covers nearly everything, I’ll let you read it!

WikidataQueryServiceR

Install the library and load it:

> install.packages("WikidataQueryServiceR")
> library(WikidataQueryServiceR)

Query all video games with a publication date, keeping only the earliest date by video game (a video game can have various publication dates, depending on the platform or the geographical area of publishing):

> r <- query_wikidata('
    SELECT ?item ?itemLabel (MIN(?date) AS ?date) (MIN(?year) AS ?year) {
        ?item wdt:P31 wd:Q7889 ; wdt:P577 ?date .
        BIND(YEAR(?date) AS ?year) .
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    GROUP BY ?item ?itemLabel
    HAVING (?year > 1)
')
25817 rows were returned by WDQS

Display the first ones:

> head(r)
                                  item                      itemLabel                 date year
1 http://www.wikidata.org/entity/Q2374               Civilization III 2001-10-30T00:00:00Z 2001
2 http://www.wikidata.org/entity/Q2377                Civilization IV 2005-10-25T00:00:00Z 2005
3 http://www.wikidata.org/entity/Q2385                 Civilization V 2010-09-21T00:00:00Z 2010
4 http://www.wikidata.org/entity/Q2387    Commandos 2: Men of Courage 2001-09-20T00:00:00Z 2001
5 http://www.wikidata.org/entity/Q2440 Freedom Force vs the 3rd Reich 2005-03-08T00:00:00Z 2005
6 http://www.wikidata.org/entity/Q2450    Heroes of Might and Magic V 2006-05-16T00:00:00Z 2006

Display the number of games published each year:

> barplot(table(r$year), col = "dodgerblue3", xlab = "year", ylab = "count")

SPARQL package

Install the library and load it:

install.packages("SPARQL")
library(SPARQL)

We use the same query as in the previous example and, as this library can query any SPARQL endpoint, we have to give it the URL of the Wikidata endpoint:

r <- SPARQL('https://query.wikidata.org/sparql','
    SELECT ?item ?itemLabel (MIN(?date) AS ?date) (MIN(?year) AS ?year) {
        ?item wdt:P31 wd:Q7889 ; wdt:P577 ?date .
        BIND(YEAR(?date) AS ?year) .
        SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
    }
    GROUP BY ?item ?itemLabel
    HAVING (?year > 1)
')

Display the first results (note that they are in r$results):

> head(r$results)
                                    item                           itemLabel       date year
1 <http://www.wikidata.org/entity/Q2374>               "Civilization III"@en 1004396400 2001
2 <http://www.wikidata.org/entity/Q2377>                "Civilization IV"@en 1130191200 2005
3 <http://www.wikidata.org/entity/Q2385>                 "Civilization V"@en 1285020000 2010
4 <http://www.wikidata.org/entity/Q2387>    "Commandos 2: Men of Courage"@en 1000936800 2001
5 <http://www.wikidata.org/entity/Q2440> "Freedom Force vs the 3rd Reich"@en 1110236400 2005
6 <http://www.wikidata.org/entity/Q2450>    "Heroes of Might and Magic V"@en 1147730400 2006

You may notice that, by default:

Still, you can display the same graph:

> barplot(table(r$results$year), col = "dodgerblue3", xlab = "year", ylab = "count")

Summary

WikidataR WikidataQueryServiceR SPARQL package
CRAN WikidataR WikidataQueryServiceR SPARQL
Repository github.com/Ironholds/WikidataR github.com/bearloga/WikidataQueryServiceR github.com/cran/SPARQL
Version 1.4.0 0.1.1 1.16
Release date 2017-09-22 2017-04-28 2013-10-25
Target Mediawiki API Wikidata Query Service any SPARQL endpoint
Features
  • get_item(id)
  • get_property(id)
  • get_random_item([limit])
  • get_random_property([limit])
  • find_item(term[,language=][,limit=])
  • find_property(term[,language=][,limit=])
  • query_wikidata(query)*
  • sparql(endpoint,query)*
Pros
  • simple and effective
  • full support of Wikibase data model (but not up to date)
  • simple and effective
  • has the power of SPARQL queries
  • has the power of SPARQL queries
  • one library to query any SPARQL endpoint
Cons
  • no support of lexemes
  • scraping of examples does not work
  • slow

(*) Both WikidataQueryServiceR and SPARQL package have options to change format behavior, not covered here.

So, what library should you use? It depends on your needs: