Blog

Partial updates of large Snowman sites

30th November 2023

Snowman is a static site generator for SPARQL backends, since its inception a goal has been that one should be able to use it to build large sites with 100,000 pages. Oneway Snowman makes this possible by relying heavily on the caching of all SPARQL queries.

Building the Govdirectory website from a blank cache would issue thousands of SPARQL queries to the Wikidata Query Service. This, however, rarely happens since Snowman’s built-in cache “manager” allows one to selectively invalidate parts of the cache. Let’s see how one would use this feature to update parts of the Govdirectory website.

Basic real-world examples

Remove all top-level country data

snowman cache countries.rq --invalidate

The above invalidates the cache for the countries.rq query.

Remove all account data for Icelands Ministry for Foreign Affairs

snowman cache account-data.rq Q15983772 --invalidate

The above invalidates the cache for the instance of the account-data.rq query which was called with Q15983772 as its only argument.

An advanced real-world example

Now, what if you want to update all account data for all Icelandic government agencies? Because the account-data.rq is no different between countries you can’t only rely only on Snowman’s cache invalidation. Instead, we need to involve some scripting.

Update all account data for all Icelandic government agencies

#/bin/sh

for i in $(find site/iceland/* -type d);
  do
  qid=$(echo ${i%%/} | cut -f3 -d"/");
  echo $qid
  snowman cache account-data.rq $qid --invalidate
done

The above script takes advantage of the fact that Govdirectory uses the identifiers from Wikidata to both build its output URIs and parameterize its SPARQL queries. The script iterates over all directories in the site/iceland/ directory(site being the directory to which Snowman writes its output) and extracts the Wikidata identifier from the directory names. It then invalidates the cache for the account-data.rq query for each of the directories.

Conclusion

Behind the scenes Snowman’s cache manager will first hash the query file name and subsequently the issued query. Thus, a hierarchy of directories is created where the first level is the hash of the query file name and the second level is the hash of the issued query. This is what enables Snowman’s support for selectively invalidating the cache.

In the cache Snowman stores the raw SPARQL resultsets in JSON and the cache command allows one to inspect the cache. For example, to see the cache for the account-data.rq query for the Icelandic Ministry for Foreign Affairs one would run:

snowman cache account-data.rq Q15983772

When planning to build a large site with Snowman I would recommend that you first put time into thinking how easy your information/data model is to query. That can be tricky with a project utilising open models such as the one of Wikidata and Wikibase, but Snowman

Do you have suggestions for how Snowman could improve its support for large sites? Check out the dedicated large-project-support issue tag on Github!

MediaWiki development with SQLite and PHP

21st November 2023

Recently I have been ranting a little bit about the many different solutions for setting up MediaWiki development environments. A visit to mediawiki.org and you will likely find solutions based on Docker, Vagrant, and custom CLI tools. Some are maintained, some are usable on some particular Linux distros, etc.

However, all you need for the vast majority of MediaWiki development is PHP and SQLite.

MediaWiki has limited SQLite support according to mediawiki.org, but I have found that it works for the vast majority of cases, and incompabilities are tracked on Phabricator.

On Fedora, I get all the requirements for running MediaWiki with dnf install php php-pdo. Then I run php -S localhost:8080 in the root of the MediaWiki repository and I’m good to go.

A downside is that I need to setup OpenSearch or Elasticsearch once in a while for tasks requiring CirrusSearch but that is a price I’m happy to pay for a stable and lightweight development environment.

Building Snowman sites on Github Actions

15th November 2023

Snowman is a static site generator for SPARQL backends. HTML templates and SPARQL queries in, a website out.

I have a set of Snowman sites that needs to be built and deployed once a day to ensure that they are up to date. I wanted to do this a while back for one of them using Github actions.

The following Github action will:

Checkout the repository
Download the Snowman binary and make it executable
Run the Snowman build command

name: build-and-deploy
on: [push]
jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    env:
      SNOWMAN_BINARY: https://github.com/glaciers-in-archives/snowman/releases/download/0.5.0/snowman-linux-amd64
    steps:
      - uses: actions/checkout@v3
      - name: Download Snowman
        run: wget "$" -O snowman && chmod +x snowman
      - name: Run SPARQL server and build site
        run: ./snowman build
        # additional steps for deploying the contents of "site" directory

That’s it, now you would have additional steps for deploying the contents of the site directory for the host of your choice.

Well, if you like me like having small sites that hold their data in one or a set of RDF files you won’t have a SPARQL endpoint to query. The solution? Run the Oxigraph database in the same Github action!

In addition to the above the following Github action will:

Download the Oxigraph binary and make it executable
Load the RDF data into the Oxigraph database
Run Oxigraph and wait for it to start before running Snowman

name: build-and-deploy
on: [push]
jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    env:
      OXIGRAPH_BINARY: https://github.com/oxigraph/oxigraph/releases/download/v0.3.16/oxigraph_server_v0.3.16_x86_64_linux_gnu
      SNOWMAN_BINARY: https://github.com/glaciers-in-archives/snowman/releases/download/0.5.0/snowman-linux-amd64
    steps:
      - uses: actions/checkout@v3
      - name: Download Oxigraph
        run: wget "$" -O oxigraph && chmod +x oxigraph
      - name: Download Snowman
        run: wget "$" -O snowman && chmod +x snowman
      - name: Load RDF 
        run: ./oxigraph load --file static/data.ttl --location datastore
      - name: Run SPARQL server and build site
        run: ./oxigraph serve --location datastore & sleep 4 && ./snowman build

The sleep 4 is there to give Oxigraph some time to start before running Snowman. It’s not a very elegant solution and it would be awesome if someone(i.e. me) could make a service container for Oxigraph.

Still looking for a full real-world example? Check out the Github action used to deploy FornPunkts Open Data site from a single DCAT RDF file.

Making Everything an RSS Feed

17th July 2023

A while back I made a goal along the line of “make all the data on fornpunkt.se available as an RSS feed”. One might ask why, well, I think that one shouldn’t be required to use the FornPunkt website to access and reuse its content. I also think that RSS is a great format for this given that most content has a temporal component to it and that RSS has many great clients, integrations, and extensions.

All posts? A GeoRSS feed.
All posts with a given tag? A GeoRSS feed.
All posts by a given user? A GeoRSS feed, optionally with an access token.
All tags? An RSS feed.
Comments? An RSS feed.
All comments on a given post? An RSS feed.
All comments by a given user? An RSS feed, optionally with an access token.
Comments classified as damage reposts? An RSS feed.
Annotations? An RSS feed.

And so on. Some of these are more useful than others. The GeoRSS ones appear to be rather popular while the “Comments on a given post” is not very much used. Ii wasn’t much overhead to add these feeds as I already had two base classes for RSS and GeoRSS feeds in the core Django application.

In the end not only do users end up with the option to use one of many RSS clients, but there is also an extra set of APIs that might be more accessible than many of the other APIs given that RSS is a well-known format and that it is easy to discover. Will I keep to this goal? Will I expand it to other sites? I don’t know, but given the low overhead in this case I do not yet regret it.