30th October 2019
In this post I’m introducing a new data quality portal that we are currently testing with the K-samsöks data partners. It’s a data quality tool without any percentages or metrics.
To solve data quality issues at cultural heritage institutions (or anywhere) two key things need to be achieved:
- Awareness - individuals need to be aware of the specific data quality issues present in their data.
- Actionability - individuals need to have the tooling and knowledge to find individual and fix quality issues.
The first one is something that aggregation actors have been targeting for a while through spreadsheets and percentages. The screenshot below shows the second(ish) iteration of our licensing statistics/issues spreadsheet that we share with data partners (the code behind it is written by my colleague Marcus and it’s open source).
Some institutions have been able to act upon the insights given by this, while others have not. A common issue is the lack of being able to query their own data, often this is because of lacking capabilities of collection management systems, sometimes it’s a combination of this and user knowledge. No matter what it’s a barrier.
Being an aggregator allows us to lower barriers for 70+ data partners so by curating and building a GUI around advanced queries. So instead of not being able to query their data or being stuck in writing some boolean and/or whatever query in their CMS they can now list problematic objects within a few clicks.
The current data quality queries available to data partners in the current proof of concept version of the data portal (we and partners have plenty of other ones in mind).
Following the selection of quality query and providing institution a list of problematic or possible problematic objects are presented to the user in a list containing a link back to the source page or CMS as well as other metadata that might be relevant for the given query.
In a couple of weeks it will be clear if this new approach have an impact. I know what I’m betting on.
4th July 2019
I have for the last few yeas had a online privacy approach in the style of “Do not put all eggs in the same basket” or exemplified in the style of “If I use Google for email I won’t use it for browsing the web”.
Now after a few years of empirical learning I have decided to change this approach. It’s clear that the owner of “my” online data (the irony) is seldom static nor does it keep the data within its own walls.
My new approach is to create as little online data as possible. Below are some actual examples of things that has lead me towards this decision.
- Recruiting firms aggregating online data about me. Although I’m not a software developer by profession I get a lot of spam from recruiting firms. I have found that these firms often has aggregated public information about me from sources such as Github and Twitter.
- Recently I discovered that two sites my employer (a government agency) hosts had been relaying on a third party service that fingerprinted our users for years and sold the data.
- Opera going from being a Oslo based company to being bought a random Chinese company. As an user I was never informed that “my” data changed owner.
There are probably plenty of cases were these types of issues have been combined and exposed information about me to third parties unknown to me.
What I’m doing to limit online data about me
- Switching from Opera to Firefox while applying extensions such as uBlock Origin. Although Firefox do not have all the nice features that Opera has (tab preview, popup video and a built in RSS reader) it should be a quite effortless switch.
- I will be creating a online presence inventory. A list of sites that I know have some non-anonymous data about me such as a user account because the first step towards action is awareness. This first step is not much of an effort but once I start to delete accounts and other information it will likely take up plenty of time.
- I will host my own DNS server and apply a whitelist. I will block every single domain that I have not looked into myself. This will break the web for me and take quite some effort to get right but it will be worth it.
- I will intercept common CDN services with one hosted on my own network. This should be a rather easy task and I’m sure there are some good solutions out there already in use.
One might see me as paranoid or a privacy geek but these actions comes from actual concerns and real world examples.
9th February 2019
I created one of those “awesome lists” for K-samsök resources, I have personally found awesome lists useful when starting with something new our just needs to investigate useful components a hobby project. Therefor I decided that it might help someone else to have one list for all the best K-samsök projects and resources so that people might get started quicker with one of Europe’s largest Linked Open Data platforms for heritage data.
The list is on Github and licensed under CC0.