Albin Larsson: Blog

Culture, Climate, and Code

Getting Random Results in SPARQL

17th September 2020

Just the other day I decided to take a stab at an old StackOverflow questing about getting random results from SPARQL.

The most obvious solution might first appear to be using SPARQL’s built-in RAND() function and order by that random number:

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(RAND() AS ?random) .
} ORDER BY ?random
LIMIT 1

This could have been perfectly fine if it weren’t for SPARQL engines trying to be smart and statically evaluating the third line. At the second result row, most SPARQL engines see a line and “thinks”, “oh this is identical to what I did on the row before this one, the result must be the same”.

This can be illustrated by the Wikidata SPARQL query below, note how all rows have the same ?random value:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(RAND() AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 100
# comment, change this before each run to bypass WDQS cache

A common solution to this issue is ditch RAND() entirely and instead hash a value in each row and sort it by the hash.

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(MD5(?s) AS ?random) .
} ORDER BY ?random
LIMIT 1

This will however generate the same order each time as the hashes are entirely based on the result data. Illustrated by the example below:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(MD5(STR(?item)) AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 100
# comment, change this before each run to bypass WDQS cache

The solution is to combine the two. The use of hashed result data makes sure the SPARQL engine can’t statically evaluate RAND() and the fact that RAND() is therefore executed each time helps avoid the selection bias.

SELECT ?s WHERE {
  ?s ?p ?o .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?s))) AS ?random) .
} ORDER BY ?random
LIMIT 1

Here again illustrated with a Wikidata query:

SELECT ?item ?itemLabel ?random WHERE {
  ?item wdt:P31 wd:Q11762356 .
  BIND(SHA512(CONCAT(STR(RAND()), STR(?item))) AS ?random) .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en" . }
} ORDER BY ?random
LIMIT 100
# comment, change this before each run to bypass WDQS cache

Know a better way to retrieve random results from SPARQL? Let me know on Twitter!

My Essential Backup Script

16th September 2020

On my main computer, there are essentially three things that change regularly and aren’t backed up to a could service regularly. These three things are notes, browser bookmarks, and keys/passwords. To be able to back up these to external servers without the hassle, I wrote a Bash script a while back that allows me to type backup anywhere in my terminals to quickly get an encrypted zip containing these files.

The following code can be placed in a file named backup inside of /usr/bin and once you have made sure your everyday user is set as the owner you are good-to-go. Typing backup should prompt you for an encryption key and following that you should have two backup files in your current directory. You should of course change the file paths as needed for you, the example is for my use of Joplin, Firefox, and Seahorse.

Some Final Notes

#!/bin/bash

echo "Starting backup"

echo "Locating Firefox profile."

FIREFOX_PROFILE=$(find ~/.mozilla/firefox/ -maxdepth 1 -type d -name *.default | head -1)

echo "Firefox profile found at: ${FIREFOX_PROFILE}"

SSH="${HOME}/.ssh"
KEYS="${HOME}/.local/share/keyrings"
BOOKMARKS="${FIREFOX_PROFILE}/places.sqlite"
NOTES="${HOME}/.config/joplin-desktop"

echo "Directories selected for backup:\n${BOOKMARKS}\n${KEYS}\n${SSH}\n${NOTES}"

echo "Zipping directories"
zip -r backupoutput.zip ${BOOKMARKS} ${KEYS} ${SSH} ${NOTES}

echo "Encrypting zip file"
gpg -c backupoutput.zip

echo "Done, remember to delete both files"

Writing Structured Data on Commons with Python

15th September 2020

Pywikibot does not yet have built-in support for writing Structured Data to Wikimedia Commons so to do so currently one needs to do it by posting JSON data to the Wikimedia Commons Wikibase API, this blog post will walk you through how to make the requests needed and how to structure the JSON to get it all working.

The minimal example presented here will check if the given file has a statement claiming that it depicts a hat and if not write such a statement.

First of you will need to have Pywikibot installed and all god to go, the following imports and code should run without error.

import json

import pywikibot

site = pywikibot.Site('commons', 'commons')
site.login()
site.get_tokens('csrf') # preload csrf token

Next up let’s turn a pagename/filename into a MID, think of a MID as Wikidata’s QID but for Wikimedia Commons. The MID happens to correspond to Mediawiki’s “pageid”.

page = pywikibot.Page(site, title='Konst och Nyhetsmagasin för medborgare af alla klasser 1818, illustration nr 44.jpg', ns=6)

media_identifier = 'M{}'.format(page.pageid)

Next up, we need to fetch all existing structured data so that we can check what statements already exist. Here is the first example where we need to use Pywikibot’s internal API wrapper “_simple_request” to call the Wikibase API, you could do the same with a regular HTTP library such as requests.

request = site._simple_request(action='wbgetentities', ids=media_identifier)
raw = request.submit()
existing_data = None
if raw.get('entities').get(media_identifier).get('pageid'):
  existing_data = raw.get('entities').get(media_identifier)

Next, let us check if depicts (P180) got a statement with the value Q80151 (hat), if so exit the program.

depicts = existing_data.get('statements').get('P180')
# Q80151 (hat)
if any(statement['mainsnak']['datavalue']['value']['id'] == 'Q80151' for statement in depicts):
  print('There already exists a statement claiming that this media depicts a hat.')
  exit()

Now we need to create the JSON defining such a claim, it’s verbose, to say the least. You can add more claims by appending more objects to the “claims” array. To get an idea of what these JSON structures can look like you can add structured data using the Wikimedia Commons GUI and then look at the resulting JSON by appending “.json” to the media’s URI. It might be particularly interesting to try out qualifiers and references.

statement_json = {'claims': [{
  'mainsnak': {
    'snaktype':'value',
    'property': 'P180',
    'datavalue': {
      'type' : 'wikibase-entityid',
      'value': {
        'numeric-id': '80151',
        'id' : 'Q80151',
      },
    },
  },
  'type': 'statement',
  'rank': 'normal',
}]}

Now, all we need to do is to send this data to the Wikibase API together with some additional information such as a CSRF token the media identifier, etc.

csrf_token = site.tokens['csrf']
payload = {
  'action' : 'wbeditentity',
  'format' : u'json',
  'id' : media_identifier,
  'data' : json.dumps(statement_json, separators=(',', ':')),
  'token' : csrf_token,
  'summary' : 'adding depicts statement',
  'bot' : True, # in case you're using a bot account (which you should)
}

request = site._simple_request(**payload)
try:
  request.submit()
except pywikibot.data.api.APIError as e:
  print('Got an error from the API, the following request were made:')
  print(request)
  print('Error: {}'.format(e))

That should be it, you can now use this example to create your own wrapper around this functionality to make it usable in batch operations.

In case you want to write SDC with the mwoauth/mwapi libraries instead of Pywikibot you can look at this Flask application built for the Roundtripping project to get a hint.

Recently Week 34 2020

24th August 2020

Things

Took part in the weekly competition on the Swedish Wikipedia about adding video material to articles. In total, I added videos to 44 articles, thanks to the Wikidata query service.

Did a Wikidata live-stream with Jan again, I highlighted property 5991 “carbon footprint” hoping that usage will go up! You can watch the recording on Youtube.

Following a two day hike this weekend I (finally (buried a hatch?)) got a iNaturalist account.

Visualized disused railways in Sweden and Norway over lunch one day.

Sceenshot of my disused railways visualization.

Reading

Wikimedia Foundation Annual Carbon Footprint Report

Python Hash Tables: Understanding Dictionaries - a dive into how Python dictionaries are built on top of hash tables.

Jupyter Notebook REST API - While researching if there is an (easy) way to turn papemill notebooks into APIs if found Jupyter Kernel Gateway which took me halfway to what I wanted.

Apple VS Epic - This is such an interesting case. Well prepared lawsuit from a rich activist that wants to change the system while Apple in parallel faces investigations already.

New in PHP 8 - each time these “new in PHP” articles come around I get an urge to get back into some PHP development (other than MediaWiki), same this time.

Cool things around the internet

The Open Restitution Project - This project looks to collect and aggregate restitution data from an African perspective. They are currently looking for partners.

Final notes

“Recently” is a new format I’m trying out, inspired by my reading lists that have been up and down for years. The intention is to be broader and include things I and others have done that won’t make it into a post on its own.

Older PostsNewer Posts