Structured Data on Commons is the most important development in Wikimedia’s usability

  • February 28, 2018
The Art Gallery of Jan Gildemeester Jansz by
Adriaan de Lelie (1794/5) – a problematic image from a structured data perspective containing multiple other paintings and persons with Wikidata items

By John Lubbock, Communications Coordinator

Wikimedia Commons is one of the most important sources of Creative Commons licensed media online. But it has a huge problem: it’s a total mess.

One of the biggest complaints about Commons which discourages people from using it is that it’s incredibly hard to find what you’re looking for. People upload files and fill out some information about their content, but not in a particularly systematic or structured way, with minimal automated assistance.

This means that when you try to search for something on Commons, you’ll be presented with a huge array of files with minimal relevance to your search. You can’t even really search for media by date. Which date would you be searching for anyway? The date the media was created, the date it was modified or the date it was uploaded? Without systematic organisation of media on Commons, it will continue to represent a disorderly and overgrown garden that desperately needs work to give it some order.

This is where the Structured Data on Commons (SDoC) project comes in. Funded by a $3m grant from the Alfred P Sloan Foundation, the project is running from 2017-19, and will hopefully lead to big changes in the way Commons is structured and used.

A ‘structured data bee’ by Sandra Fauconnier – image by Sandra Fauconnier CC0

Adam Baso, Engineering Director at the Wikimedia Foundation, says that ‘This current quarter involves the beginnings of search work, although the specific UX is a work in progress.’ Program lead Amanda Bittaker added that they are ‘starting to dig into specifications and designs for search this quarter (while also designing and prototyping for upload wizard and the file page.)  If all goes according to plan, we should have designs to share next quarter.’

SDoC is somewhat like an attempt to make Commons more like Wikidata – to introduce fields rather than categories which would be searchable in any language. This will structure and link data on the files hosted on Commons, which will hopefully make it more user friendly and usable. You can see the roadmap the SDoC team have laid out for the development of the project here.

These ‘fields’ will work like statements on Wikidata, but many media files have what Amanda Bittaker calls ‘tricky ontologies’ – that is, how entities should be ‘grouped, related within a hierarchy, and subdivided according to similarities and differences’, according to the Wikipedia article on Ontology. They have compiled an interesting collection of Commons files which illustrate examples of media that raise interesting questions for the structuring of data on Commons. Files with names written in non-Latin scripts, files with multiple levels of copyright or authorship, paintings containing other paintings, abstract depictions, unknown subjects, and some maps and logos all create problems for organising the data.

When talking to Wikimedians more familiar with the software side of the websites than I am, asking questions such as ‘why can’t we search Commons by date?’ has always led to the response ‘we will be able to do all these things after Structured Data is brought in.’ This gives an idea of how central this initiative is for the entire user experience of Commons and other projects that seek to improve it.

One of these projects is being led by Miriam Redi, a visual research scientist working for the Wikimedia Foundation but based in London. Her Twitter bio notes that she ‘teaches machines to see the invisible’. You can check out her presentation about her work in a recent Wikimedia Research showcase video here:

Miriam’s work is separate from the work being done by the SDoC team, and is not funded by the Sloan grant, but has aspects which have an important bearing on the work being done to bring order to the overgrown garden of media on Commons.

Miriam notes in the showcase that as many as 95% of Wikidata items currently don’t have images associated with them, and asks if we can make it easier to automate image selection.

To this end, Miriam is piloting the design of algorithms to discover and recommend freely licensed images that can be added to Wikidata items or Wikipedia pages. To do this, she is using facial recognition software as well as metadata matching, and is also training an image quality algorithm on about 160,000 images from the Category:Quality images from Commons. The trick will then be to combine image relevance and quality to discover the best images that can be added to Wikidata items.

Miriam’s research complements work done by Magnus Manske, who has created many tools which do similar things to link databases and match images to Wikidata items.

I talked with Miriam recently about the idea of creating a simple mobile app which showed the user images with suggested ‘fields’, allowing the user to then swipe left or right to say whether or not the image should be tagged with the suggested category. This would allow the community to help organise the uncategorised images on Commons much more efficiently.

Integrating algorithms in the curation of media files on Commons is currently not on the SDoc project roadmap and may have to wait for more progress on the infrastructure and curation interfaces. When I talked to Creative Commons director Ryan Merkley last year about improvements to the CC search, I discovered that Creative Commons is also waiting for progress on SDoC before they start indexing the images on Commons, because it is hard to index them properly without improvements to the structure of the metadata on Commons.

Of course, nothing happens as quickly as we would like it to in terms of improving the user experience of Wikimedia projects. However, it’s encouraging that there are people working to improve the way that Commons functions, because it is such an important tool which could be even more useful if it was easier to use.

One thought on “Structured Data on Commons is the most important development in Wikimedia’s usability”

Leave a Reply

Your email address will not be published. Required fields are marked *