Solving Project Euler problems with SPARQL and Wikidata

Founded in 2001, Project Euler is a website offering a series of more than 700 mathematical problems that you can solve with the programming language of your choice. Unsurprisingly, Python, C/C++, and Java are the most popular languages used by visitors of the website. I wanted to know if it was possible to use Wikidata and its SPARQL endpoint to solve these problems.

Spoiler: this post gives the solution of the first problem and discusses how to solve the second. Don’t hesitate to first go to Project Euler and to find the solutions by yourself.

Problem 1

The objective of the first problem is to find the sum of all the multiples of 3 or 5 below 1,000.

Let’s find all natural numbers in Wikidata, with their numerical values:

SELECT ?item ?value {
  ?item wdt:P31 wd:Q21199 ; wdt:P1181 ?value .
}
ORDER BY ?value

As you can see, Wikidata has more than 10,000 items about natural numbers. Some values are duplicated, like Q199 (the number 1) and Q713113 (the number 0.999… which is equal to 1).

Let’s remove the duplicates and filter the numbers to keep only the ones below 1,000:

SELECT DISTINCT ?value {
  ?item wdt:P31 wd:Q21199 ; wdt:P1181 ?value .
  FILTER (?value < 1000) .
}
ORDER BY ?value

There are exactly 999 numbers between 1 and 999, so none is missing.

Let’s keep only the numbers that are the multiples of 3 or 5, by checking that their division by 3 or 5 has no remainder:

SELECT DISTINCT ?value {
  ?item wdt:P31 wd:Q21199 ; wdt:P1181 ?value .
  FILTER (?value < 1000 && (floor(?value / 3) = (?value / 3) || floor(?value / 5) = (?value / 5))) .
}
ORDER BY ?value

Here are the results.

Now, we just have to sum the numbers:

SELECT (SUM(DISTINCT ?value) AS ?sum) {
  ?item wdt:P31 wd:Q21199 ; wdt:P1181 ?value .
  FILTER (?value < 1000 && (floor(?value / 3) = (?value / 3) || floor(?value / 5) = (?value / 5))) .
}

And Wikidata gives us the solution of the problem!

Problem 2

The objective of the second problem is to find the sum of even Fibonacci numbers below 4,000,000.

Let’s find Fibonacci numbers available in Wikidata:

SELECT ?value {
  ?item wdt:P31 wd:Q47577 ; wdt:P1181 ?value .
}
ORDER BY ?value

Only the twenty first ones, below 10,000, are known in Wikidata. As a general-purpose database, Wikidata is certainly not made to store every existing numbers under 4,000,000. It makes the method previously used in the first problem not applicable here.

At the moment, I don’t see any simple solution to solve this second problem using Wikidata’s SPARQL endpoint, and probably neither to the following ones. Maybe you can do it?

For your information, more complex problems, not related to Project Euler, have been solved using SPARQL, outside of Wikidata’s endpoint, like Sudoku (for instance here or there).

A short history of Denelezh

A few weeks ago, Gnom was wondering what is the actual size of the gender gap:

Do we have an estimate of the size of the gender gap on Wikipedia given the current notability criteria? In other words, if every notable living or dead person had an article, what percentage would be about women? For example, would 30% women biographies be a good ballpark for a ‘closed’ Wikipedia gender gap?

This was my answer, giving some insights about the history of Denelezh, a tool that provides statistics about the gender gap in the content of Wikimedia projects:

Your question is the one I had in mind when I built the first version of Denelezh. My idea was to split the problem into smaller ones and to work by subsets to facilitate the study. There are subsets where the percentage of women is perfectly known. For instance, here are the percentages of women in the lower house of the French Parliament since 1945. In the Wikipedia in French language, members of parliaments automatically comply with notability criteria. Other subsets have to be studied. The tool partially allows to do so by providing statistics about humans depicted in Wikidata along various dimensions that you can combine (for instance country of citizenship + occupation = French politicians). But merging these subsets is not easy, as they are overlapping: you can’t simply add the percentages from two sets when some people belong to both. An athlete can also be a politician, someone can have several countries of citizenship, and so long…

The first version of the tool provided statistics about humans in Wikidata with a gender, a year of birth, and a country of citizenship. The assumption (unverified) was that Wikidata items with all these properties were of better quality than the ones with one or more missing properties. The problem is that statistics were only about 50% of humans depicted in Wikidata, and thus were misleading for people studying the gender gap in Wikimedia projects.

The second version, which development was rushed, solved this problem by no longer filtering Wikidata items on the number of properties they have and providing statistics as long as the data was available. With this change (and the addition of Wikimedia projects as an available dimension to filter / combine), it became closer to WHGI. One current problem is that the tool lack statistics on Wikidata quality (for instance, how many Wikidata items depicting humans have the property gender?).

The third version will be a merge of Denelezh and WHGI. Some ideas are already in the pipeline (adding external IDs as a dimension, producing lists of notable people to help Wikimedia editors to find subjects to work on, etc.), some others are on Phabricator. Feedback welcome 🙂

2020 personal plans around Wikidata and Wikibase

After a look back at 2019, I will present my two main projects around Wikidata and Wikibase in 2020: the merge of tools providing statistics about the gender gap in the content of Wikimedia projects and the creation of a tool, based on Wikibase, about video games reviews.

2019

Here are the main events related to Wikidata and Wikibase that I attended in 2019:

Gender gap and Denelezh

Denelezh is a tool that provides statistics about the gender gap in the content of Wikimedia projects. Thanks to Sylvain Boissel, it is now hosted by Wikimédia France.

At Wikimania, Maximilian Klein, author of WHGI, and I met. We decided to merge our tools, as some of their features are overlapping. The idea is to create a new tool, keeping existing features, and building a new architecture to easily implement new ones, relying on our experience with WHGI and Denelezh.

You can track the progress in T230184.

Video games reviews and Wikibase

In my first years of contribution to Wikidata, I wanted to put every data I found in it, like Wikidata was a black hole. However, this is not a good idea, for several reasons. As a general knowledge base, Wikidata scalability is a concern, both technically (for instance, its Query Service struggles to keep updated) and on a human level (the more Wikidata grows, the more it needs volunteers to maintain its data, consistently with all topics covered by Wikidata). A solution is to rely on other knowledge bases, Wikidata being a hub between them. In brief, as I already stated, Wikidata is not the database to rule them all, but the database to link them all.

In 2019, I started to work on video games reviews. At the moment, I’m not convinced by the data model used in Wikidata and would like to try some new ones. It’s not really possible in Wikidata, as you can’t change the data model every fortnight just for testing purpose, as this would impact every consumer of this data (whether they are Wikimedia projects or projects outside the Wikimedia movement). Another problem is that, as Wikidata is a generalist database, you have to comply to rules from every sub-projects: you can’t just focus on a project (like video games) but have to cope with every related projects, which can be quickly cumbersome (for instance, creating someone’s family name is not straightforward).

For all of these reasons, I decided to create my own database. To this end, Wikibase, the Mediawiki extension that powers Wikidata, seemed to be the ideal candidate. Using your own Wikibase allows you to build your own data model, with the convenience that it is technically compatible with Wikidata. The use of Wikibase by Wikidata proves that it’s a robust tool and very efficient if your data set is not too big. Also, thanks to Wikidata, Wikibase has a large ecosystem, even if some tools still need to be generalized to be used with any Wikibase instance (examples: the merge feature, constraints reports, etc.). The one-day Wikibase workshop at WikidataCon finished to convince me: the easiness, using Docker, to install a Wikibase instance with a complete ecosystem running (including the Query Service and QuickStatements) was amazing. My project is now ongoing.

On the community side of Wikibase, I wrote the largest part of the 2019 report of the Wikibase Community User Group and was designated to represent the user group at the Wikimedia Summit 2020.


Photo by Anthere, CC BY-SA 4.0.