By Magnus Manske, MediaWiki developer and longtime Wikimedian
The new year is just over two weeks old, but the WikiVerse already celebrated a joyous event: Wikipedia’s 18th birthday! 50 million seems to be the number of the day – 50M articles across Wikipedia editions, 50M files on Commons, 50M items on Wikidata. But all this free content does not appear in a few big strokes, it comes from millions of uploads and edits. Your work, our work, build this vast repository of knowledge, one small action at a time. I would like to use this opportunity to share with you a few of those actions I have been involved with in these first few days of the year.
My hope is to inspire you to look at new areas of our project, to take a leap and follow an interesting tangent, but above all, to remember that every edit, every cited reference, every vandalism revert adds to the Sum Of All Knowledge, and that it will be valuable to someone, some day, some place.
Images
Sometimes, knowledge is already present in our project, it is just cleverly hidden, and begs to be released. For example, Wikidata has many items about people, some of them with an image. Wikidata also has items about paintings, and some of these have an image as well, but they might not have a “depicts” statement.
But if the image of the painting is the same used for a person, it is likely (though not guaranteed!) that the person depicted in the painting is that person. A simple SPARQL query shows us about a thousand of such item pairs. And even if the image is not of the person (for example, sometimes a painting by a painter sneaks into the item as a painting of the painter), it can be an opportunity to remove a wrong image from the item about the person.
Similarly, over 1700 Wikidata items use an image of a church, but another “church item” uses it as well, often revealing either a wrong image use, or a duplicate item.
User:Christoph_Braun has used my Flickr2Commons tool to upload over a thousand historical images released under a free license by the Linz am Rhein city archive to Commons. You can help put these pictures to good use, by finding Wikidata items (and by extension, often Wikipedia pages) without an image, by coordinates or by category. If you want to add free images to Wikidata items, but don’t want to go hunting for them, the FileCandidates tool has hundreds of thousands of prepared possible image-to-item matches waiting for you. And if you would like to add more “depicts” statements to items, topicMatcher is there for you (also offering “main subject” and “named after”).
Mix’n’Match
Mix’n’Match is one of my more popular tools, especially with the authority control data fans. It has passed 50 million entries recently, most of which are waiting to be matched to a Wikidata item. To cut down on the number of entries that need the “human touch” to be matched, I have various helper scripts running in the background to automatically match entries to items, when it is reasonable safe to do so.
One of these helper scripts uses the name, birth, and death year for biographical entries to find a match on Wikidata. Since entries are imported from many different sources, getting metadata (such as birth/death dates) for an entry is not standardized. I had already written bespoke code to extract such dates from the entry descriptions for several catalogs, but this year I sat down and systematically checked all ~2000 catalogs for date information in their entries, and to extract them where possible. The fact that one finds dates ranging from plain years, over ISO format, to free-text French, requires individual code for every single catalog with dates. This is now complete, as of a few days ago. Initial runs led to over ten thousand new matches with Wikidata. Of course, all those matches are turned into Wikidata statements as well, where the catalog has an associated property.
In a similar fashion, I have code to extract third-party identifiers (e.g. VIAF) from descriptions or web pages of entries. These can then be used to match entries to items, or to add those identifiers as statements to an already matched item. Matching on such identifiers requires them to be present in Wikidata, so adding such statements on Wikidata proper directly helps Mix’n’Match (and everyone, really). If you want to give it a try, this list has over 1000 items that likely have a GND (and probably VIAF) identifier, but are missing from Wikidata.
Some catalogs are more easy to match to Wikidata than others. Entries with ambiguous names and no description are hard. Biographical entries with a description, birth, and death date are much better. Taxonomic entries with Latin species names are easiest, as we have a Wikidata property for those, and plenty of species to match to. Usually, automated matching can get >90% for these. However, this new catalog about fossil plants has less than 3% matches. A new area to be imported and curated on Wikidata!
Matching Mix’n’Match entries to items helps Wikidata only if there is a property associated with the Mix’n’Match catalog. Likewise, it is helpful to link from a property to the associated Mix’n’Match catalog(s). I have created a new status page that shows missing links and inconsistencies. This complements my reports on individual catalogs. All of these reports are updated regularly.
This and that
Most Wikipedia articles have an associated Wikidata item. However, newly created articles are often not immediately linked to Wikidata, via a new or an existing item. These “Wikidata-orphaned” articles can be found “by wiki”, for example English Wikipedia. It is a constant battle to prune that list manually, even with a game for that purpose. The number of such orphaned articles shows a curious pattern of “mass-matching” and slow build-up. Some investigation shows that a Wikidata user regularly creates new items for all orphaned articles, across many wikis. While this links the articles to Wikidata, it potentially creates a lot of duplicate items. Worse, since these items are blank (apart from the site link to the article, and a title), automated duplication detection is hard.
To get a handle of the issue, I found all blank items created by that user in that fashion.
That list amounts to over 745K (yes, 3/4 of a million) blank items. For convenience, I have created a PagePile for them. Please do note that this is a “snapshot”, so some of these items will receive statements, or be merged with other items, over time.
Structured Data is coming to Commons!
For starters, there are multi-lingual file descriptions, but statements should follow during the course of this year. Since this is using Wikibase (the same technology underlying Wikidata), it will use (more-or-less) the same API. I have now prepared QuickStatements to run on Commons, however, the API on Commons is not quite ready yet. Once the API is functional, you should be able to edit Commons MediaInfo data via QuickStatements, just as you can edit Wikidata items now.
I had a few reports on PetScan dying for certain queries. It turns out that using a huge category tree (say, >30K sub-categories) will cause the MySQL server to shrug, taking PetScan with it. I have re-written some of the PetScan code to run several smaller chunks of such a query instead. It seems to work well, but please report any strange results to me.
I hope this little tour has given you some ideas or motivation for work on our project. Happy new year everyone, and may your edits not be reverted!