Wikisym 2012 Report

From Wikimedia UK
Jump to navigation Jump to search


Wikisym 2012 was hosted by the Ars Electronica Centre in Linz, and coincided with the Ars Electronica Festival The Big Picture. WikiSym is one the most important annual venues for sharing research, ideas and experience around Wiki collaborations, and was attended by academics, researchers and Wikimedia representatives from around the globe.

With the travel award generously provided by the Wikimedia UK, Gavin Baily was able to attend the symposium and present research projects at both WikiSym and Ars Electronica.

The work presented at symposium covered a broad range of themes, and was informed by the ever accumulating archives of Wikipedia and other Wiki projects. The scale and complexity these datasets provides researchers with a rich source of meta data that is generated in the course of Wiki production. Examples of this meta data include talk pages and comments, language inter-links, citations, revision logs, user account histories and usage, and also semantic analysis of page content. These features of the data were used to ask broader questions about the sociology, content, and cultural diversity of Wiki projects, and ultimately how Wiki communities function, and how they might be developed in the future.

In the following sections I'll outline some of themes that emerged from the symposium, focusing on research around Wikipedia and the diverse methods of analysing content. In particular I'll look at work that explores how Wikipedia is represented in different language editions, a topic that was central to the symposium keynotes, and the subject of a number of research papers. The work outlined here represents a small fraction of the contributions, for a comprehensive program see WikiSym 2012.

Wikipedia representation by language and culture

Jimmy Wales

One of the highlights of the symposium was Jimmy Wales keynote speech which focused on the opportunities and challenges for Wikipedia over the next 5 to 10 years. As a key aspiration Wales asked us to 'Imagine a world in which every single person on the planet is given free access to the Sum of all Human knowledge'. In terms of growth, Wales identified the developing world and in particular Africa as a site for increasing participation. Although increasing internet access is the main driver in this, Wales said that he had caused some controversy when he tweeted that “Broadband speed in Nigeria. Beats New York City”

As part of the effort to encourage participation in under-represented languages Wales described the sudden growth of the Yorùbá edition in 2011, and his presentation of the second annual 'Jimbo Award' to User:Demmy for this single-handed contribution. User:Demmy had created a bot that added 15,000 articles in one month, doubling the editions size, and increasing the number of active editors. Drawing a parallel with other editions of Wikipedia, Wales said that the use of bots to generate a large number of articles (particularly for geographic locations) had been significant in bolstering Polish, and indeed English.

Average number of edits to Wikipedia by Country, from Mark Graham's Where do Wikipedia edits come from?

Considering some of the challenges facing Wikipedia, Wales discussed gender imbalance, the slight decline in the number of English editors, and strategies for improving the user-friendliness of Wiki user interfaces. In relation to gender imbalance Wales described the case of Kate Middleton's dress. When the article was first written it was flagged for deletion as lacking notability, despite the widespread media interest. This in contrast to the volume of articles about Linux releases, which arguably have a smaller audience. Jodi Schneider pointed out that the dress controversy has spawned a whole category Royal wedding dresses.

On the subject of English Wikipedia's decline, Wales speculated that one possible factor is that so many historic subjects already have comprehensive articles. He cited the case of George Wallace, 45th Governor of Alabama, a less famous politician, who already has a very comprehensive article. Looking to the future of Wikipedia's editing tools, although innovations such as the visual editor are in development, Wales was concerned that too greater conservatism within the community would impede efforts to experiment with more user-friendly interfaces, and so encourage new contributors.


Brent Hecht

Mining and Applying Diverse Perspectives in User-Generated Content

The second keynote was given by Brent Hecht from NorthWestern University, who discussed how Wikipedia reflects cultural contexts and some new tools and algorithms for examining this cultural diversity. Hecht set the scene by describing a tale of two history books. Growing up he noticed that family friends from Mexico had rather different accounts of the Mexican American War.

These discrepancies prompted questions about how Wikipedia articles compare across different languages, and also whether language editions cover the same set of concepts. Using concept alignment algorithms that indicate whether an article in one language refers to the same concept as that in another, Hecht showed that across all Wikipedias most concepts belong to a single language and that few concepts appear in a large number of languages. The result was illustrated with the example of chocolate, a common concept across many languages, but with a fraction of the articles compared with those for culturally unique chocolate products. One of the key points is that that larger language editions are not supersets of smaller ones, and that the set of concepts for an encyclopedia is culturally specific. For example the French and German editions are a similar size but only share 33% of the same concepts.

Hecht next outlined how language editions are biased towards countries where the language is prominent, a phenomena he described as the 'self-focus bias'. To measure the comparative degree of self-focus for a language, Hecht used the 'Indegree Sum'. This is calculated by summing the number of articles that refer to locations in each country. For example in the Finnish Wikipedia the articles Eurovision Iaulukilpailu, Linus Torvalds and Alfred Hitchcock all refer to Helsinki which increase the Finland count. The Alfred Hitchcock article also points to London which increases the UK count. The result is an article count for each country that can be displayed as a choropleth map, and which typically indicates self-focus.

Russian Wikipedia Indegree Sums by Brendt Hecht - Measuring Self-Focus Bias in Community-Maintained Knowledge Repositories

In the final section of his talk Hecht presented two tools that explore how Wikipedia concepts are represented across different language editions, and how they can be spatial categorised. Omnipedia visualises how each Wikipedia concept is referenced in 32 language editions, highlighting which articles are unique and which are shared across languages. Hecht demonstrated the software with the search concept conspiracy theory. In the screenshots below the single colour dots are unique related articles, and the multicoloured disks show articles occurring in various languages. The Hebrew Wikipedia is the only one to mention a Middle Eastern conspiracy theory about Microsoft Windows, whereas The Protocols of the Elders of Zion is a more widely held conspiracy.

Omnipedia - conspiracy search

Expanding on the research around self-focus bias, the Atlasify project maps the relatedness of a concept to any one of a number of spatial reference systems. As an example Hecht showed how the concept Nuclear Power can be visualised as a choropleth map on three different reference systems: the World Map, the Periodic Table, and the U.S. Senate Seating Plan. This extraordinarily powerful search system is applicable to any Wikipedia category, and will be of great interest to a variety of Wikipedia users.

Atlasify visualizing the query concept 'Nuclear Power' on the 'World Map' reference system
Atlasify - Periodic Table
Atlasify - Senate seating plan

Paolo Massa

Manypedia: Comparing Language Points of View of Wikipedia Communities by Paola Massa, Federico Scrinzi

Paolo Massa's presentation on Linguistic Points of View in Wikipedia dealt with the problem of the neutral and unbiased voice in the context of diverse cultural and lingustic communities. In another formulation of Brent Hecht's Tale of Two History Books, Massa asked:

'Do editors on Arabic Wikipedia and editors on Hebrew Wikipedia write the same history of the “Gaza war”?'.

Massa drew attention to Wikipedia's own policies and dialogue around Wikipedia:Neutral_point_of_view (NPOV), and the known biases that result from author demographics. The page Wikipedia:Systemic bias states that:

“The Wikipedia project suffers from systemic bias that naturally grows from its contributors' demographic groups, manifesting an imbalanced coverage of a subject, thereby discriminating against the less represented demographic groups.”
“The average Wikipedian on the English Wikipedia is a male, technically inclined, formally educated, an English speaker (native or non-native), European–descent, aged 15–49, from a majority-Christian country, from a developed nation, from the Northern Hemisphere, and likely employed as a white-collar worker or enrolled as a student rather than employed as a labourer”.

The concept of the NPOV is in itself contested, Massa cited Roy Rosenzweig's characterisation of it as Wikipedia's "founding myth", a "view from nowhere". Massa went on to describe a number of Wiki encyclopedia projects that have established their own POV. is Cuba's Wikipedia from a decolonizer point of view. The atheist point of view comes in the form of, and to discover what Wikipedia and the liberal media don't want you to know about see

In the last section of his talk Massa demonstrated ManyPedia, a tool that enables cross-cultural analysis of specific Wikipedia articles. Having searched for a topic the user can select two different languages to compare how the subject is represented. These can be automatically translated to the users native tongue using Google translate. To see at a glance some of the salient differences between articles the images for each page are displayed together at the top.

Image grab of Manypedia - Manypedia is a project of Paolo Massa and Federico "fox" Scrinzi of SoNet group at FBK

Morten Warncke-Wang

In Search of the Ur-Wikipedia: Universality, Similarity, and Translation in the Wikipedia Inter-language Link Network by Morten Warncke-Wang, Anuradha Uduwage, Zhenhua Dong, John Riedl

Morten Wang, from GroupLens Research at the University of Minnesota, examined the question of the Ur-Wikipedia. Analysing the inter-language link network (ILL), Wang considered which concepts are shared between languages, how the similarity between encyclopedias can be measured, and what part translation has to play in the production of content.

Wang's first question was to find the universal concepts that nearly every Wikipedia includes. One method of finding universal concepts is to select articles that appear in more than half the 283 editions and that also have a large amount of content. Filtering by article length excludes some freak articles have been translated by individuals across all languages, although they are culturally specific. One such example, the True Jesus Church is the most widely translated article.

Countries, cities, continents, celestial bodies and, above all, significant dates are cross-cultural topics. In addition to universality, Wang discussed variation in article uniqueness across editions. The number of unique articles varies greatly. In October 2011 the Waray-Waray Wikipedia had only 0.078%, and Hindi Wikipedia 70.29% unique articles. An analysis of the general categories under which these articles fall shows that they tend to be about people, places, organisations, historic events, and cultural artefacts. The interesting question is why the massive difference between Wikipedias?

Table of wikipedia language translations - Morten Warncke-Wang

Wang's second question looked at how the ILL network can be used to measure the similarities and differences between Wikipedias. Comparing and refining various statistical techniques, Wang showed that the dominant factor in measuring edition similarity is their raw number of articles. This is counter-intuitive since one would expect geographic or cultural factors to be more important, although relatedness of language was also shown to be a factor.

Finally Wang asked how much of the information in a Wikipedia comes from translations from other languages? The MediaWiki template for indicating translations is an indicative source of data, although incomplete because the template will have not always been used correctly. This shows that for the 10 largest Wikipedias English is by far the most common source of translations, often representing over 75% of the total. On the basis of these results Wang discussed models for facilitating content distribution, and pros and cons of pushing content to English as a source for translation to other editions.

Andreas Kaltenbrunner

Biographical Social Networks on Wikipedia: A cross-cultural study of links that made history by Pablo Aragon, Andreas Kaltenbrunner, David Laniado, Yana Volkovich

Andreas Kaltenbrunner from the Barcelona Media Foundation explored how Wikipedia's biographical articles are related, the relative centrality of historical figures in different languages, and how the network of links between biographies compares across language editions. The basis of the research is measures of the links between biographical pages, including links into each biography from others, the links out, and analysis of the relative centrality of biographies in the complete network.

Looking at the biographies for each language ranked by betweeness there are some surprising results, particularly for the Russian Wikipedia in which Kenneth Branagh and Elton John are more central than Stalin, Lenin or Gorbachev. (Betweeness measures the fraction of shortest paths between other pairs of biography nodes passing through a given node).

Table of biographies for each language ranked by 'betweeness' - Andreas Kaltenbrunner

Comparing the biographical networks for all language Kaltenbrunner presented one of the symposiums great visualisations. Taking only the links between biographies that exist in 13 of the 15 languages studies, the following network of 1663 biographies shows the commonality between figures that Wikipedia regards as significant.

visualisation detail - network of 1663 biographies shows the commonality between figures that Wikipedia regards as significant - Andreas Kaltenbrunner

Characterising this network, Kaltenbrunner says:

"The largest connected component in [the figure] corresponds to a cluster of US Presidents which connects over Ronald Reagan to a cluster of British Premier Ministers. This group is related through Winston Churchill to a cluster of persons from WW2’s axis powers. The second largest component is compound of several clusters related to the music and entertainment business, and the third one of two clusters of male and female tennis players connected through Dinara and Marat Safin. Other large isolated clusters can be found around such diverse groups as Russian and Chinese political figures, French presidents, Israeli and Palestinian politicians, Formula One pilots, World Chess Champions or actresses."

Mapping Wikipedia

As part of the demo and poster session, Gavin Baily and Bernie Hogan presented their project Mapping Wikipedia. This interactive map of Wikipedia's geo-located articles allows users to search by language and geographic region. The map highlights cultural biases in geographic coverage, and nuances in the self-focus phenomena. In its current form the project focuses on languages used in the Middle East, indicating vast inequalities in representation between English, French, Arabic, Hebrew, Persian and Swahili Wikipedias. In addition to plotting the article locations and content, the map displays various article metrics that delineate regions of article stubs, topographies of image content, and highly edited locations.

The most recent feature of the application is a timeline that filters articles by creation date. Adjusting the date range shows how different language editions have evolved over time, and highlights the use of bots to automatically populate geographic regions.

Sustaining Wikipedia communities

Many of the symposium papers addressed patterns of participation in Wikipedia communities, asking questions about who Wikipedians are, barriers to becoming a Wikipedean, and how communities can be sustained. There is a widely cited statistic that since 2007 the number of Wikipedia authors has plateaued. This phenomenon has prompted various research projects both from academics and the Wikimedia Foundation. Ryan Faulkner and Maryana Pinchuk from the Wikimedia Foundation presented Etiquette in Wikipedia: Weening New Editors into Productive Ones, which looked at the experience of new users and experiments into making Wikipedia warning messages more user-friendly.

Heather Ford, R. Stuart Geiger considered how Wikipedians can become more effective contributors in Writing up rather than writing down: Becoming Wikipedia Literate , and Jodi Schneider discussed factors contributing to article deletion, and the impact of deletions discussions Deletion Discussions in Wikipedia: Decision Factors and Outcomes. Kate Middleton's Wedding dress is a case in point.

Carlos Castillo presentation on Drawing a Data-Driven Portrait of Wikipedia Editors characterised contributors browsing and research habits. A key finding was that they are more likely to look at programming sites and less likely to look at adult sites than other internet users.

In relation to the slight decline in active contributors Dell Zhang asked How Long Do Wikipedia Editors Keep Active?, comparing survival functions to estimate the longevity of active users.

Wikipedia's response to world events

A number of papers analysed how news stories and current affairs are reflected in Wikipedia articles. Brian Keegan compared the article revision graphs for breaking and non-breaking news stories. Visualising the revision graphs of articles concerning airline disasters shows that there is greater interaction between users for breaking news stories Staying in the Loop: Structure and Dynamics of Wikipedia’s Breaking News Collaborations.

Visualising the revision graphs of articles concerning airline disasters

Michela Feron in her paper Psychological processes underlying Wikipedia representations of natural and manmade disasters, used text analysis of emotional terms in article content to analyse differences in author response to particular types of disaster.

The relationship between user comments, edits and current affairs featured in Andreas Kaltenbrunner paper There is No Deadline - Time Evolution of Wikipedia. One example charted the article Barack Obama shows the relative user attention to specific events in his presidency.

Peaks in the edit and discussion activity of the article 'Barack Obama' - Karltenbrunner

Ars Electronica : The Big Picture

The first Symposium session "Overview Effect" was organised by SEED Magazine who develop visualisation projects, and who promote work in the field through their blog Introducing the session, Adam Bly considered a range of global issues including the environment, trade and human rights, and asked whether holistic visualisations of these systems would give us a more inter-connected vision. Each of the speakers, Johan Bollen, Manuel Lima, Paola Antonelli, and Golan Levin referenced the significance of Wikipedia and gave examples of Wikipedia visualisations. Manuel Lima charted a history of drawings and diagrams representing taxonomies of knowledge, ending with taxonomies of Wikipedia.

SEED also organised an exhibition at the Bruckerhaus in which the Mapping Wikipedia project was included among other visualisations relating to the Big Picture theme. A gallery of the exhibits can be seen here @ Ars Electronica: The Big Picture.

[philosophers image]