Data on the history of Scottish witch trials added to Wikidata

North Berwick Witches – the logo for the Survey of Scottish Witchcraft database (Public Domain, via Wikimedia Commons)

By Ewan McAndrew, Wikimedian in Residence at the University of Edinburgh

The first Wikidata in the Classroom assignment at the University of Edinburgh took place last semester on the Data Science for Design MSc course. Two groups of students worked on a project to import the Survey of Scottish Witchcraft database into Wikidata to see what possibilities surfacing this data as structured linked open data could achieve.

Meeting the information & data literacy needs of our students

The Edinburgh and South East Scotland City Region has recently secured a £1.1bn City Region deal from the UK and Scottish Governments. Out of this amount, the University of Edinburgh will receive in the region of £300 million towards making Edinburgh the ‘data capital of Europe’ through developing data-driven innovation. Data “has the potential to transform public and private organisations and drive developments that improve lives.” More specifically, the university is being trusted with the responsibility of delivering a data-literate workforce of 100,000 young people over the next ten years; a workforce equipped with the data skills necessary to meet the needs of Scotland’s growing digital economy.

The implementation of Wikidata in the curriculum therefore presents a massive opportunity for educators, researchers and data scientists alike; not least in honouring the university’s commitment to the “creating, curating & dissemination of open knowledge”. A Wikidata assignment allows students to develop their understanding of, and engagement with, issues such as: data completeness; data ethics; digital provenance; data analysis; data processing; as well as making practical use of a raft of tools and data visualisations. The fact that Wikidata is also linked open data means that students can help connect to and leverage from a variety of other datasets in multiple languages; helping to fuel discovery through exploring the direct and indirect relationships at play in this semantic web of knowledge. This real-world application of teaching and learning enables insights in a variety of disciplines; be it in open science, digital humanities, cultural heritage, open government and much more besides. Wikidata is also a community-driven project so this allows students to work collaboratively and develop the online citizenship skills necessary in today’s digital economy.

At the Data Science for Design MSc’s “Data Fair” on 26th October 2017, researchers from across the university presented the 45 masters students in Design Informatics with approximately 13 datasets to choose from to work on in groups of three. Happily, two groups were enthused to import the university’s Survey of Scottish Witchcraft database into Wikidata (the choice of database to propose was suggested by a colleague).

This fabulous resource began life in the 1990s before being realised in 2001-2003. It had as its aim to collect, collate and record all known information about accused witches and witchcraft belief in early modern Scotland (from 1563 to 1736) in a Microsoft Access database and to create a web-based user interface for the database. Since 2003, the data has remained static in the Access database and so students at the 2018 Data Fair were invited to consider what could be done if the data were exported into Wikidata, given multilingual labels and linked to other datasets? Beyond this, what new insights & visualisations of the data could be achieved?

Packed house at the Data Fair for the Data Science for Design MSc course – 26 October 2017 (Ewan McAndrew, CC-BY-SA)

The methodology

A similar methodology to managing Wikipedia assignments was employed; making the transition from managing a Wikipedia assignment to managing a Wikidata assignment an easy one. The two groups of students underwent a 1.5 hour practical induction on working with Wikidata and third party applications such as Histropedia (“the timeline of everything”) before being introduced to the Access database. They then discussed collaboratively how best to divide the task of analysing and exporting the data before deciding one group would work on (1) importing records for the 3,219 accused witches while the other group would work on (2) the import of the witch trial records and (3) the people associated with these trials (lairds, judges, ministers, prosecutors, witnesses etc).

The groups researched and submitted their data models for review. Once their models had been checked and agreed upon, the students were ready to process the data from the Access database into a format Wikidata could import (making use of the handy Wikidata plug-in on Google Spreadsheets). Upon completion of the import, the students could then choose how to visualise this newly added data in a number of ways; such as maps, timelines, graphs, bubble charts and more. The students finished their project by showcasing their insights and data visualisations in various mediums at the end of project presentation day on the 30th of November 2017.

Here are two data visualisation videos they produced:

North Berwick witches – the logo for the Survey of Scottish Witchcraft database (Public Domain, via Wikimedia Commons)

The way forward

We now have 3219 items of data on the accused witches in Wikidata (Spanning 1563 to 1736). We also now have data on 2356 individuals involved in trying these accused witches. Finally we have 3210 witch trials themselves. This means we can link and enrich the data further by adding location data, dates, occupations, places of residence, social class, marriages, and penalties arising from the trial.

The hope is that this project will aid the students’ understanding of data literacy through the practical application of working with a real-world dataset and help shed new light on a little understood period of Scottish history. This, in turn, may help fuel discoveries by dint of surfacing this data and linking it with other related datasets across the UK, across Europe and beyond. As the Survey of Scottish Witchcraft’s website states itself Our list of people involved in the prosecution of witchcraft suspects can now be used as the basis for further inquiry and research.“

The power of linked open data to share knowledge between different institutions, between geographically and culturally separated societies, and between languages is a beautiful thing. Here’s to many more Wikidata in the Classroom assignments.

Structured Data on Commons is the most important development in Wikimedia’s usability

The Art Gallery of Jan Gildemeester Jansz by
Adriaan de Lelie (1794/5) – a problematic image from a structured data perspective containing multiple other paintings and persons with Wikidata items

By John Lubbock, Communications Coordinator

Wikimedia Commons is one of the most important sources of Creative Commons licensed media online. But it has a huge problem: it’s a total mess.

One of the biggest complaints about Commons which discourages people from using it is that it’s incredibly hard to find what you’re looking for. People upload files and fill out some information about their content, but not in a particularly systematic or structured way, with minimal automated assistance.

This means that when you try to search for something on Commons, you’ll be presented with a huge array of files with minimal relevance to your search. You can’t even really search for media by date. Which date would you be searching for anyway? The date the media was created, the date it was modified or the date it was uploaded? Without systematic organisation of media on Commons, it will continue to represent a disorderly and overgrown garden that desperately needs work to give it some order.

This is where the Structured Data on Commons (SDoC) project comes in. Funded by a $3m grant from the Alfred P Sloan Foundation, the project is running from 2017-19, and will hopefully lead to big changes in the way Commons is structured and used.

A ‘structured data bee’ by Sandra Fauconnier – image by Sandra Fauconnier CC0

Adam Baso, Engineering Director at the Wikimedia Foundation, says that ‘This current quarter involves the beginnings of search work, although the specific UX is a work in progress.’ Program lead Amanda Bittaker added that they are ‘starting to dig into specifications and designs for search this quarter (while also designing and prototyping for upload wizard and the file page.)  If all goes according to plan, we should have designs to share next quarter.’

SDoC is somewhat like an attempt to make Commons more like Wikidata – to introduce fields rather than categories which would be searchable in any language. This will structure and link data on the files hosted on Commons, which will hopefully make it more user friendly and usable. You can see the roadmap the SDoC team have laid out for the development of the project here.

These ‘fields’ will work like statements on Wikidata, but many media files have what Amanda Bittaker calls ‘tricky ontologies’ – that is, how entities should be ‘grouped, related within a hierarchy, and subdivided according to similarities and differences’, according to the Wikipedia article on Ontology. They have compiled an interesting collection of Commons files which illustrate examples of media that raise interesting questions for the structuring of data on Commons. Files with names written in non-Latin scripts, files with multiple levels of copyright or authorship, paintings containing other paintings, abstract depictions, unknown subjects, and some maps and logos all create problems for organising the data.

When talking to Wikimedians more familiar with the software side of the websites than I am, asking questions such as ‘why can’t we search Commons by date?’ has always led to the response ‘we will be able to do all these things after Structured Data is brought in.’ This gives an idea of how central this initiative is for the entire user experience of Commons and other projects that seek to improve it.

One of these projects is being led by Miriam Redi, a visual research scientist working for the Wikimedia Foundation but based in London. Her Twitter bio notes that she ‘teaches machines to see the invisible’. You can check out her presentation about her work in a recent Wikimedia Research showcase video here:

Miriam’s work is separate from the work being done by the SDoC team, and is not funded by the Sloan grant, but has aspects which have an important bearing on the work being done to bring order to the overgrown garden of media on Commons.

Miriam notes in the showcase that as many as 95% of Wikidata items currently don’t have images associated with them, and asks if we can make it easier to automate image selection.

To this end, Miriam is piloting the design of algorithms to discover and recommend freely licensed images that can be added to Wikidata items or Wikipedia pages. To do this, she is using facial recognition software as well as metadata matching, and is also training an image quality algorithm on about 160,000 images from the Category:Quality images from Commons. The trick will then be to combine image relevance and quality to discover the best images that can be added to Wikidata items.

Miriam’s research complements work done by Magnus Manske, who has created many tools which do similar things to link databases and match images to Wikidata items.

I talked with Miriam recently about the idea of creating a simple mobile app which showed the user images with suggested ‘fields’, allowing the user to then swipe left or right to say whether or not the image should be tagged with the suggested category. This would allow the community to help organise the uncategorised images on Commons much more efficiently.

Integrating algorithms in the curation of media files on Commons is currently not on the SDoc project roadmap and may have to wait for more progress on the infrastructure and curation interfaces. When I talked to Creative Commons director Ryan Merkley last year about improvements to the CC search, I discovered that Creative Commons is also waiting for progress on SDoC before they start indexing the images on Commons, because it is hard to index them properly without improvements to the structure of the metadata on Commons.

Of course, nothing happens as quickly as we would like it to in terms of improving the user experience of Wikimedia projects. However, it’s encouraging that there are people working to improve the way that Commons functions, because it is such an important tool which could be even more useful if it was easier to use.

Celtic Knot 2018

Celtic Knot Conference, Edinburgh 2017 – image by Llywelyn2000 CC BY-SA 4.0

The Wikipedia language conference is back, and it’s coming to Wales.

By Jason Evans, National Wikimedian for Wales

Celtic Knot provides a stage and a meeting place for contributors to small and minority language Wikipedia’s from all around the world. You can now submit a paper and register for this years conference, which will be held at the National Library of Wales in the Welsh seaside town of Aberystwyth.

In 2017 Ewan McAndrew, the Wikimedian in Residence at the University of Edinburgh launched the first Celtic Knot conference. As the name suggests the original focus of the conference was the 6 Celtic language Wikipedias including Welsh, Gaelic, Breton, Cornish and Irish. And whilst Celtic languages were well represented, the conference also attracted delegates from Norway, Catalonia and the Basque country, to name but a few.

The issues, challenges and opportunities facing smaller language communities on Wikipedia are often very different to larger Wikipedias such as English, which has a large contributor base, technical support community and 5.5 million articles. There are also advantages whereby smaller wikis can move quickly and implement new technology, as has been the case with Wikidata. Celtic Knot acts as a meeting point for these smaller communities in order to coordinate and develop ideas common to smaller wikis.

Aberystwyth shore – image by Gjt6 CC BY-SA 3.0

Last year there were discussions about the Sami (davvisámegiella) Wikipedia, which has just 7000 articles and a handful of editors, and the Cornish (Kernowek) Wikipedia with 3,800 articles and less than 1000 native speakers. How can we engage the speakers of these languages with their Wikipedias? And what would be the benefits and impacts of doing this successfully?

At the other end of the spectrum are the hugely successful Basque and Catalan Wikipedias with nearly 1 million articles between them. What drives their success? And what can smaller communities learn from them? The Welsh Wikipedia is being taught in schools, and is being supported by the Welsh Government – can this good practice  be replicated elsewhere successfully?

Wikidata, the wiki community, bilingualism and education will all be huge themes at this years conference as we work together to grow the diversity of Wikipedia and to encourage the use of local languages in a digital environment.

The conference will be opened by the Welsh assembly Minister for Welsh Language and Lifelong Learning, Eluned Morgan, and we expect to announce more exciting speakers to the programme over the coming months.

With limited tickets available, book early to ensure your spot at the conference!

Wikipedia and journalism: researching a Fake News site with Wikipedia as a starting point

By John Lubbock, Communications Coordinator

I saw a friend on Facebook share an obvious piece of fake news a few days ago. It came from a site I’d seen before, World News Daily Report.

I looked at the Wikipedia page for the site, which said it was

a fake news, satirical, purportedly American Jewish Zionist newspaper based in Tel Aviv, Israel,[1][disputeddiscuss] dedicated to covering news and events from around the world in, by its own account, a completely ludicrous manner.[3]

It also stated that the website’s owner was called Janick Murray-Hall, which it evidenced with a Buzzfeed article. Obviously the site’s own description of itself was a red herring, perhaps with some odd political or racist undertones.

‘Janick Murray-Hall runs World News Daily Report, a fake news site that scored five hits in the top 50 thanks to fake political headlines such as, “ISIS Leader Calls for American Muslim Voters to Support Hillary Clinton” and crime hoaxes like, “Morgue Worker Arrested After Giving Birth To A Dead Man’s Baby.”’

The top hit on Google when searching for Mr Murray-Hall is an article about another site he runs, the Journal de Mourréal, a site which attempts to look like and mock the real Journal de Montréal. This article mentions that Murray-Hall also uses the alias Bob Flanagan, a name so common that you won’t turn up anything particularly useful by searching for it.

Another article I discovered in French by a guy who has been investigating fake news interviewed Olivier Legault, a friend of Murray-Hall, and co-founder of JdM and WNDR. This provided another good reference to populate the WNDR Wikipedia page with.

The author of this article criticises Canadian ‘humorists’ who ‘were tearing their shirts to defend the Journal de Mourréal, their right to satire and, in a roundabout way, freedom of expression,[while]  our two friends were busy polluting the English-language web with lies and hoaxes. Their other site is called the World News Daily Report (WNDR). There is no humor or satire here. This is pure false news, misinformation.’

The Radio Canada interviewer told Legault that he could not see the humour in the stories he made for WNDR. Legault’s response was vague and evasive.

‘”It depends … Sometimes, the idea is to think about an issue,” he says.

He hesitates a little.

“In a sense, there is a purpose in all this, to take self-criticism to people,” he finally said.’

When pressed that many people will not think hard enough to spot even obvious hoaxes, Legault says that if people want to believe something, it’s their fault essentially.

“The people who take it seriously are people who want to take it seriously. It’s stupid to say, but … We preach to converts. The majority of people who share it understand that it’s a joke, and others share it because they want to believe it, not because they really believe in it”.

Legault then admits that he finds it ‘a little disturbing’ how people will believe anything, saying:

“You can invent everything and anything and people will believe it. Honestly, it’s a little disturbing when you realise that. As long as you confirm what they want to believe, they will share it. If you go against their opinion, they will immediately think that this is false news. But if you go in the direction of their opinion, they will share it right away. They lose their critical spirit.”

One really big problem with exploiting people’s stereotypical or lazy assumptions is that many of them are based on racist, xenophobic or misogynistic ideas about the world. Looking on WNDR right now gives all manner of stories that might appeal to people’s bigoted beliefs about how the world works, like ‘FORMER ISIS SOLDIER EMASCULATED BY GOAT SEEKS REFUGEE STATUS IN CANADA’, ‘MUSLIM MAN SAYS BACON MIRACULOUSLY CURED HIM OF HOMOSEXUALITY’, or ‘IRISH FARMER CLAIMS HE WAS SEXUALLY ASSAULTED BY A LEPRECHAUN’. Imagine finding out your stock photo is now the Irish farmer who says he’s been molested by a leprechaun.

The Yellow Press – image from the Library of Congress, public domain

Legault concludes by saying that even journalists have repeated his stories as factual, and he keeps a copy of a Quebec magazine that reprinted one.

So does Legault really believe that he’s encouraging people’s self-criticism, or is this just an obvious excuse to hide that he’s doing the exact opposite: exploiting people’s lack of it?

Whatever the case, these sources should go into improving the Wikipedia page for WNDR, to improve the resource and allow people to more easily identify the site as a place dedicated to exploiting the confirmation biases of people who don’t think too hard about the news they produce.

The aesthetic and stories in WNDR remind me somewhat of ‘Weekly World News’, a paranormal news publication that was featured in the Bradford Science and Media Museum’s Fake News exhibition which I went to recently to interview curator John O’Shea. This publication also walked a fine line between exploiting people’s beliefs (though moreso belief in the supernatural than their political prejudices) to make money, while also sometimes acknowledging that it was a load of rubbish.

I’ve been reading Daniel Kahneman’s book on cognitive biases, Thinking, Fast and Slow, and it makes it obvious to me that we need to get over our belief that humans are rational actors, and that we are patronising them if we say they are too easily exploited. People’s brains are lazy, and operate on a thousand different rules of thumb we employ to reduce our mental labour. Fake News, like many other psychological techniques, knows very well that it is exploiting this laziness, and it doesn’t care, because it’s making money out of it.

This is dangerous, both to society and to political systems, and we need to arm people with the tools to see when they are being used and exploited for political or financial gain.

The way I used Wikipedia as a starting point for research into this topic helped me to find pieces of information which were missing from the page, which acted as a repository for the new information I discovered through searching online. The whole exercise created a workflow, where I began and ended at the Wikipedia page, going to do my own research and discovering new things in between, and then returning to make them easily available for others on the page.

This kind of workflow is, I believe, one which journalism students should be doing to teach themselves how to do research online, and providing them with an immediate place to publish and collate new information they discover. I see no reason why people studying journalism at university should not practice this kind of exercise as an important way to understand Wikipedia and improve their research and writing skills.

If you’re a journalist and you come across information that is missing from Wikipedia, you can do a useful service by incorporating references and new information into articles like this one. Real journalism should be able to improve people’s abilities to tell truth from fiction, and that is exactly the same thing that Wikipedia tries to do, so I personally think that they compliment each other in important ways.

#1Lib1Ref in Scotland

SLIC #1Lib1Ref Team – image by Morag Wells CC BY-SA 4.0

Glasgow Women’s Library; the Scottish Poetry Library; the Scottish Storytelling Centre; James Kelman’s classic How Late It Was, How Late; and Helen Cruickshank – all Wikipedia articles that were improved during the course of SLIC’s #1Lib1Ref activity.  

In case you missed the hubbub around the campaign, #1Lib1Ref is the annual drive to get more librarians and library staff engaged with Wikipedia – specifically, to add just one citation to the encyclopedia during the period 15 Jan – 3 February, tagging their edits with #1Lib1Ref.   

Librarians are natural allies of the open knowledge movement, and this campaign has been designed to provide an easy introduction.  At SLIC I’m working to get public library services across Scotland engaged with Wikimedia, and to shine a light on some of the amazing collections held in our libraries.

Staff training at Glasgow Caledonian University Library – image by Sara Thomas CC BY-SA 4.0

We’re currently in the first phase of our project here, working with Inverclyde, North Lanarkshire and North Ayrshire library services to run staff awareness and training, leading hopefully to a self-sustaining programme of editathons in those services.  They’re just getting started – the date for our first editathon has been set, with the second close on its heels, read more on my 6 month project report here.  Inverclyde also managed during this time to get a bit of #1Lib1Ref activity going!

My own #1Lib1Ref goal was to add a citation a day, and tweet about it… you can see everything I did here – with the most engagement coming from the work I did on the article about Adele Patrick, a co-founder of the Glasgow Women’s Library, and a winner of Scotswoman of the year.  

In mid January we were also very happy to welcome the next generation of library professionals, in the shape of Jenny, who’s a student placement from the University of Strathclyde’s Information and Library Studies Masters.  She’ll be with us until the end of March, has an interest in information literacy, and her first project with us was to take part in #1Lib1Ref – which resulted in the creation of a new article – for The Suffragette Oak.

We had a team day in the office at SLIC just before the end of the campaign, with 6 editors making 39 edits on 10 articles!  And eating pizza, just because.  

Last but not least, an event which took place just a day or so after the campaign finished (so I’m including it here, because quite frankly I can): the Don’t Cite Wikipedia, Write Wikipedia! day with staff from the Glasgow Caledonian University Library.  15 editors made 71 edits over six articles, four of them new: (Orkney Library and Archive, Shetland Library, Lady Bruce of Clackmannan and Jess Smith (writer).)

1Lib1Ref worklist – image by Sara Thomas CC BY-SA 4.0

I’m absolutely delighted to have had the opportunity both to shout a bit louder about some of Scotland’s libraries, librarians and writers, and to talk to library professionals about the sort of thing that we can achieve through engagement with Wikimedia projects. I really hope that next year we can get much more involved.  This being my second residency that involves sector-wide advocacy, I know that making changes on this kind of scale can take time – and that’s why campaigns like this, that can show individuals that they’re part of a global movement – are important.  

How should journalists use Wikipedia?

44th Munich Security Conference 2008 – Image by Kai Mörk, freely licensed under CC BY 3.0 (Germany).

Wikipedia receives about 18 billion page views per month from around 1.4 billion unique devices. While many people use Wikipedia for basic research into subjects which they want a simple understanding of, Wikipedia and its sister projects also have many uses which can be useful for journalists to make part of their research process.

Traditionally, journalists, like students, have been discouraged from using Wikipedia as part of their research, but this attitude is slowly changing as people realise that while information should not be exclusively sourced from Wikipedia articles, Wikimedia projects (like Wikipedia, Wikimedia Commons and Wikidata) can be powerful tools for journalists.

Some general rules

Citation needed protester – image by futureatlas.com CC BY 2.0

Don’t just accept facts, check the citations. Every fact on Wikipedia should be condensed from another source. Those sources go in the references list at the bottom of the page.

When former Guardian editor Paul Preston died recently, the Guardian repeated a claim in his Wikipedia page that a book he had written, The 51st State, had been made into a film with Samuel L Jackson. There was a film with that name and Jackson in it, but it was not based on Preston’s book.

This is probably the most basic point, but it’s worth reiterating. Asking a journalist friend of mine whether she used Wikipedia, I was told that nobody trusts the information provided on Wikipedia alone because the editing process is open to all. In practice however, many journalists cut corners and still often plagiarise whole sections of articles to save time, even if their managers tell them not to.

Plagiarism also breaks the Creative Commons licences that all information on Wikipedia is shared under, which state that you are allowed to reuse any of the content as long as you attribute the source. So especially if you’re a group of Japanese lawmakers on a tax-funded fact-finding trip to the US, you should avoid copying entire sections of Wikipedia to save time, as it’s very easy for people to find out you did that kind of thing.

Basically, don’t be lazy. Wikipedia’s information is transparent, and allows you to see its provenance. You might also want to check the View History tab at the top of the page to see who has been editing it and when, and look on the Talk tab to see what issues with the page have been considered by its editors.

Familiarise yourself with Wikipedia’s rules and guidelines

Only around 30% of all pages on Wikipedia are the articles themselves. The rest are the talk pages, user pages, policies, WikiProjects, Lists, disambiguation pages, Category pages and so on. The most meta level of Wikipedia’s rules is contained in what we call the Five Pillars of Wikipedia. These are:

  1. Wikipedia is an encyclopaedia. It’s not a soapbox, an advertising platform or a vanity press.
  2. Wikipedia is written from a Neutral Point of View (NPoV).
  3. Wikipedia is free content that anyone can use, edit or distribute. Everything is published on Open Licenses, mainly Creative Commons Attribution-ShareAlike 4.0
  4. Wikipedia’s editors should treat each other with respect and civility
  5. Wikipedia has no firm rules.

The next most important rules are the Notability Criteria, and the Reliable Sources guidelines, or you can check out the entire List of guidelines.

Understanding sources

One important aspect of media literacy that these guidelines can teach you is that not all sources are made equally. Wikipedia does not accept self-published sources as reliable, such as petitions, blogs and social media posts. It also discourages the use of tabloid news sources where better ones are available.

A decision taken by editors to include the Daily Mail in this list of discouraged sources made headlines last year when it was widely reported as ‘Wikipedia bans the Daily Mail’. Daily Mail sources are not banned on Wikipedia. Thousands still exist, but should be replaced by better ones if they are available.

Wikipedia is what is called a ‘tertiary’ source. It’s not an eyewitness account or opinion (a primary source), or a secondary source, which combines and discusses information first presented elsewhere. It should be a summary of the best available primary and secondary sources about a subject. Tertiary sources are not ‘academic level’ sources, so you shouldn’t cite Wikipedia articles in academic papers. You’re also not allowed to cite one Wikipedia article in another one.

With this in mind, it follows that journalists should use tertiary sources like Wikipedia as background research to understand a topic, and find good primary and secondary sources that can help them dig deeper into the subject they are researching. As Wikipedia co-founder Jimmy Wales said back in 2011,

“Journalists all use Wikipedia. The bad journalist gets in trouble because they use it incorrectly; the good journalist knows it’s a place to get oriented and to find out what questions to ask”.

Science and medicine

Total solar eclypse – image by Michael S Adler

Not all parts of Wikipedia are the same. Much higher reliable source guidelines exist in the scientific and medical areas of Wikipedia, where generally only well known journal articles will be accepted.

I once tried to improve the article on Water Fasting, in response to someone on Twitter saying it wasn’t very good. It’s not very good, but that’s because there have not been any reliable medical studies done on it which have been written about in reliable medical journals, so there’s not much that can be said. My edits to enlarge the page with information from articles in the Huffington Post and other sources all got deleted.

Once again, one of the best ways to find out how Wikipedia works is simply to edit it. This fact is somewhat similar to the commonly referenced idea of Ward Cunningham, inventor of the wiki, who said that “the best way to get the right answer on the internet is not to ask a question; it’s to post the wrong answer.”

This discourse between people with competing views generally helps to make Wikipedia more reliable over time: as articles are edited by more and more people, the information becomes better, as a Harvard Business Review investigation discovered in 2016.

Tools

Because all of Wikipedia’s data is open and searchable by machines, there’s lots of interesting things that people are doing to make use of the data. Some bots scan the list of edits for vandalism, or people editing from IP addresses known to come from the UK Parliament or US Congress. People have also made Twitter accounts that automatically publish these edits on Twitter. Here are a couple of them:

Parliament Edits – https://twitter.com/parliamentedits

Congress Edits – https://twitter.com/congressedits

Maybe you want to find out what GPS data exists on Wikipedia and its sister sites in a particular area. Check out the WikiShootMe tool to see Wikipedia articles, photos from Wikimedia Commons and data from Wikidata with GPS coordinates displayed on an OpenStreetMap.

Another useful thing you can use to see data about Wikipedia pages that is not immediately obvious is the Page View tool. Every Wikipedia page has a link on the left hand side under Tools that says ‘Page information’. Clicking this link and then scrolling to the bottom of the page, you will find another link called ‘Page view statistics’. This will take you to a tool where you can see how many visits that page had within a given time period of your choosing. Using this tool can provide useful insights, such as the fact that the biggest spike in views to the European Union article was on the day following the Brexit referendum.

Wikimedia Commons and Wikidata

When I was a freelance journalist, one of the most useful aspects of the Wikimedia projects was the free image database, Wikimedia Commons. All of Wikipedia’s media files are hosted there, all under Creative Commons attribution licenses, meaning that you can reuse or modify them in any way you like, as long as you credit the original author of the media. Look in the details below any image to see the creator or user who uploaded the file. Your credit should look like this ‘[Image name] by [username], CC BY-SA 4.0, Wikimedia Commons’.

To search Commons for images, try typing Category: into the search box, followed by a type of image you’re looking for. You can then search for images tagged with this category. Alternatively, try looking through the list of Featured images or quality images for something to use.

Searching Commons categories.

Wikipedia and its sister projects are huge and complicated, and there are so many ways that they could be used by journalists that there are probably many things missing from this list. If you’re a data journalist, for example, and want to do complex data queries, I would suggest learning how to use Wikidata. You can find tutorials online like this one or this one to show you how to use the Query Service to search the data.

https://www.youtube.com/watch?v=1jHoUkj_mKw

If there’s one final piece of advice I would like you to remember, it’s that all media is made by people, whether it’s on the BBC, YouTube or Wikipedia. The only way to really understand a platform like Wikipedia is to get involved in editing it. Understanding how it works should allow you to get the most out of it, and to avoid writing simplistic articles like I often see on sports websites, about how X football player’s Wikipedia page was HACKED by fans of an opposing team. Editing a Wikipedia page is not hacking it. It’s what you’re supposed to do to it. Pages get vandalised all the time, and there are sophisticated ways that have been developed for reverting and flagging vandalism that ensure it doesn’t tend to last for long.

You should also remember that Wikipedia is still a work in progress. That’s why the logo is a puzzle piece globe of different languages. The majority of people editing Wikipedia are still white, male, and from the global North, and this affects the content on the sites. There are lots more articles about military history, WWE wrestlers and men in general than there are about women, queer or non-European people or history. Part of being a good journalist is being aware of these biases, and taking them into account in your writing.

Thanks for reading this guide, and I hope you have found some of it useful!

3000 new articles added to the Welsh Wicipedia

Volunteers at the National Library of Wales have been translating important health articles from English into Welsh – image by Jason Evans CC BY-SA 4.0

By Jason Evans, National Wikipedian for Wales

Improving health related content on the Welsh Wicipedia

The Welsh language Wicipedia is the most viewed website in the Welsh language, and articles about health related issues are among the most frequently viewed. And yet only around 2% of Welsh articles cover this subject, compared to more than 6% in English.

Welsh speakers deserve access to quality health information in their native tongue, but currently hugely important topics such as cancer, mental health and medical treatments have very little coverage. So, in July 2017 the National LIbrary of Wales, with Welsh Government funding and Wikimedia UK support, embarked on a 9 month project to improve this content. The project was called Wici-Iechyd (Wiki-Health).

A series of edit-a-thons and translation projects has already lead to the creation of over 250 hand written articles. Many have been translated from English articles prepared for use in other languages by the WikiMed project. Other are derived from text released on an open licence by project partners including the British Lung Foundation, WJEC and the Mental health information service Meddwl.org.

The big news this month is the creation of 2700 articles about human genes. The articles were created using information from Wikidata and PubMed and images from Wikimedia Commons. Since all articles about genes follow a similar format it was possible to generate and upload the 2700 articles en mass. The articles include information about the location and structure of the genes as well as synonyms. All include a bibliography with the 5 most recent publications about each gene. Wikimedia UK were involved in producing a Wikidata Infobox which pulls in an array of data, images and citations. Naturely time was also spent ensuring Wikidata had Welsh labels for items which were likely to be called on by the infobox.

Members of the Royal College and Nursing improving content at an edit-a-thon in Cardiff – image by Jason Evans CC BY-SA 4.0

It is hoped that many future improvements to health related content will link to these articles about genes giving a greater depth of information on the subject.

This upload alone represents a 2.8% increase in the total article count for Welsh Wikipedia, however with more articles being prepared, on diseases, drugs and medical pioneers we could see close to a 5% increase by the end of the project. It is likely that health related content as a percentage of the total article count will be comparable to, or better than the ratio in the much larger English Wikipedia.

The project is funded until the end of March, but it is hoped the Wici-Iechyd will continue to thrive as a Wiki project on the Welsh Wicipedia.

US Second World War propaganda films migrated to Commons

Victor Grigas, a video producer and storyteller who has worked with the Wikimedia Foundation for a number of years, posted on the Wikimedia Video Production House Facebook group yesterday that he had migrated Frank Capra’s Second World War films from YouTube to Commons so they can be used on Wikipedia.

The Why We Fight series of films was made by Frank Capra in response to Leni Riefenstahl’s Nazi propaganda film, The Triumph of the Will. Capra described Riefenstahl’s film as ‘a psychological weapon aimed at destroying the will to resist’. Capra later wrote in his 1971 autobiography,

‘I sat alone and pondered. How could I mount a counterattack against Triumph of the Will; keep alive our will to resist the master race? I was alone; no studio, no equipment, no personnel.’

All content made by the US government is Public Domain by default, and Grigas found the videos on the US National Archives YouTube channel.

Under the US Copyright Act 1976, “a work prepared by an officer or employee” of the federal government “as part of that person’s official duties” is not entitled to domestic copyright protection under U.S. law and is therefore in the public domain.

Grigas used the Video2Commons tool to migrate the files from YouTube. There is quite a lot of US government public domain video on YouTube, which you can search through Creative Commons’ search site. Although low resolution versions at 320p already existed on Commons, the transfer means there are now high quality ones available.

“I just saw the low-resolution versions on Wikipedia and thought that these films might have a better transfer out there and I was right. I saw these films in film school and they were enormously influential, I mean they copy elements of them in Star Wars. So I thought I should improve these articles”, Grigas said.

If you find any good public domain video online and add it to Commons for use on Wikipedia, why not tell us about it?

A guide to the past: hillforts and Wikimedia

Barbury Castle in Wiltshire is one of more than 4,000 prehistoric hillforts in Britain and Ireland. Photo by Geotrekker72, licensed CC BY-SA 4.0.

What is the Atlas of Hillforts?

Hillforts are enormous archaeological sites dotted around Britain and Ireland. There are some of the most impressive remains from prehistory. Just five years ago the best guess for how many there might be was ‘likely … over 4000’, but now thanks to the efforts of the University of Oxford and the University of Edinburgh we know there are 4,147 and have a wealth of information about them at our fingertips.

Back in 2013, archaeologists at Oxford and Edinburgh teamed up to work on the Atlas of Hillforts. Their four-year mission was identify every single hill fort in Britain and Ireland and their key features. This had never been done before, and as Oxford’s Prof. Gary Lock said it would allow archaeologists to “shed new light on why they were created and how they were used”.

Some hillforts like Maiden Castle are well known and archaeologists have examined them for decades, but these give us only a postage-stamp-size glimpse of the huge overall picture. There are thousands of hillforts in Britain and Ireland, so if you want to understand them it’s important to have foundational information such as how many there are, where they can be found, and to build on that by adding information on what type of site it is. The more information there is, the more analysis you can do. That’s what the Atlas set out to achieve.

When the project was under development, Wikimedia UK was supporting a Wikimedian in Residence (WIR) at the British Library, Andrew Gray. He talked to the the people involved in the project and suggested using Wikipedia to share the results of the project. After all they were going to create a free-to-access online database. Perhaps the information could be used to update Wikipedia’s various lists of hillforts?

Fast forward to the summer of 2017 when the Atlas launched. At this point Wikimedia UK was supporting a WIR at the University of Oxford, Martin Poulter. His work includes helping researchers use the Wikimedia projects to increase their impact, and he worked with the Atlas of Hillforts project to share information from their database on Wikidata. Together they selected a set of information from the Atlas which Martin then uploaded to Wikidata.

Why is this project important?

It contains a huge amount of information: details of investigations at each site, a bibliography of related sources, even what kind of dating evidence there is. If you are writing about hillforts today – whether as an academic or for Wikipedia – it would be a very good idea to start by going to the Atlas of Hillforts to see what information it has on a site and what other sources of information it signposts.

For example, here is the record for Mellor hillfort in Greater Manchester. It includes any alternative names, its reference number for the Historic Environment Record (HER), a grid reference, and a summary of the site. It also gives details of nine sources you explore for more information, and tells you when it was investigated (geophysical survey in 1998 and excavated between 1998 and 2009). It tells you what kind of dating evidence there is, and you might notice there here it doesn’t have information on how many entrances the hillfort had and what shape they were. That’s because the site has been largely destroyed, as mentioned in the summary. That gives a Wikipedia editor a lot of information to work with.

Creating an atlas like this is a crucial way to share information; it creates a gold standard for information in the field and because it is much easier to find information about a site, it’s easier to stay up to date, make comparisons with other sites, and spend more time analysing this information and pushing forward our understanding.

Map of hill forts in the British Isles, created in the Wikidata Query Service using data shared by the Atlas of Hillforts. Image created by Martin Poulter, licensed CC0.

Why is this useful for Wikipedia?

The information from the Atlas can be used to update lists as initially hoped as well as create visualisations for Wikipedia, and be used by editors to update and create articles. The English Wikipedia’s pre-existing content on hillforts was seen by 5,299 people a day in June 2017. Since the information is in Wikidata, it can be used in different language Wikipedias. The appeal of Wikimedia isn’t just the reach of the project, but the fact that in Wikimedia Commons it has a database of free-to-use images. There are nearly 3,600 media files of hillforts on Commons which complements the Atlas which only has vertical aerial photos from Google Maps.

Most importantly, the Atlas is a very high quality resource which will benefits Wikipedia’s editors and readers. It is likely to be used again and again and shape how people understand these prehistoric sites.

For more technical information on how the data from the Atlas was added to Wikidata, see Martin Poulter’s blog post on the Bodleian’s website from October.

Talking to Creative Commons’ Ryan Merkley about CC Search and Structured Data on Commons

Creative Commons’ Ryan Merkley and Wikimedia Foundation Exec Director Katherine Maher at Mozfest 2017 – Image by Jwslubbock CC BY-SA 4.0

CC Search beta was launched in February. This new tool incorporates ‘list-making features, and simple, one-click attribution to make it easier to credit the source of any image you discover.’ Its developer, Liza Daly, describes it as ‘a front door to the universe of openly licensed content.’

As a small organisation, Creative Commons did not have the resources to start by indexing all of the 1.1 billion Openly Licensed works that it estimates are available in the Commons. Liza Daly decided to start with a representative sample of about 1% of the known Commons content online, and decided to select about 10 million images rather than a cross-section of all media types, due to the fact that a majority of CC content is images.

One issue they encountered was in making sure that all the content they would include was CC licensed, where a provider (like Flickr) hosted content that was both CC and commercially licensed. They also decided to defer the use of material from Wikimedia Commons, saying that,

‘Wikimedia Commons represents a large and rich corpus of material, but rights information is not currently well-structured. The Wikimedia Foundation recently announced that a $3 million grant from the Sloan Foundation will be applied to work on this problem, but that work has just begun.’

The Wikimedia Foundation understands that the resources available through Wikimedia Commons are not as accessible as they could potentially be as a result of the ad hoc nature of much of the metadata attached to the files people have uploaded. For example, one common query is ‘Why can’t I search Commons by date’. The problem here is ‘which date?’ Is it the stated date that the photo was taken (which could be incorrect) or the date that the file was created, which could be different?

This is why Structured Data is so important. The $3m grant that the WMF has received to implement structured data on Commons, in a similar way to how it’s structured on Wikidata, will allow for much better searching and indexing of media files.

CC search wants to make CC content more discoverable, regardless of where it is hosted online. To do this, they decided to import the metadata from the selected works that they are currently indexing –  title, creator name, any known tags or descriptions. This data will link directly back to the original source so you can view and download the media. It seems that in its current, unstructured state, Wiki Commons is not very good for systematically importing this kind of metadata.

It seems that Creative Commons is even looking at the possibility of using some kind of blockchain-like ledger system to record reuse of CC licensed works so that reuse can be tracked. However, this remains a longer term goal.

I asked Creative Commons CEO Ryan Merkley some questions about how the project had been progressing since its announcement and how it might work.

WMUK: How much progress has been made on CC search since the start of 2017? Have you indexed many more than the original 10 million media items?

RM: CC has hired a Director of Product Engineering, Paola Villarreal to lead the project. We’re staffing up the team, with a Data Engineer starting soon. In addition, we’ll be pushing a series of enhancements, including adding new content, by the end of the year.

WMUK: Will you have to wait until the end of the Structured Data on Commons project to index Wikimedia content? Or does the tool only require basic metadata categories like Title, Creator, Description, Category Tags, meaning it be possible to start this before the end of the project?

RM: We’re happy to work with the Wikimedia Commons community on the project. In our initial conversations, we mutually decided to wait until some of that work was further along. We want to make sure our work is complementary.

WMUK: Is it still an ultimate ambition to use some kind of blockchain architecture to record reuse? Or is that potentially a goal that would require more resources than will likely be available for the foreseeable future?

RM: Not necessarily. There’s a lot of interesting work going on with the blockchain and distributed ledger projects. What’s most important to us is a complete, updated, and enhanced catalog of works and metadata that is fast and accessible.

WMUK: Can you explain how ledger entries would be created when someone reused a CC licensed work?

RM: The tools to track remix don’t exist right now. It’s something we’re really interested in, and our community wants as well. It will require new tools, and collaboration with platforms and creators.

There are so many incredible applications possible for all the data on Wikimedia Commons, and we hope that after the content is structured properly, it will become a valuable source which can be searched along with other CC content online using Creative Commons’ CC Search tool. Like a lot of the changes we would like to see in the way the Wikimedia products work, this will likely take some time, but we are hopeful that the wait will be worth it.