Talk:Technology Committee/Project requests/WikiRate - rating Wikimedia

From Wikimedia UK
Jump to navigation Jump to search

SMART[edit source]

I think this is now being used so much as a jargon word that the meaning was lost a while back. Metrics are not SMART, could this please be corrected as this just promulgates further misunderstanding? -- (talk) 20:11, 12 April 2014 (BST)

'SMART metrics' -> 'SMART targets'. Thanks for pointing that out. --MichaelMaggs (talk) 20:57, 12 April 2014 (BST)

Thoughts on task breakdown[edit source]

Thanks Michael this looks like a great start. I'm just having a bit of a brain dump here. There are some challenges on which input, or/and experimentation will be required. These include:

  • Defining the classes/outcomes
  • Defining the relevant input feature set (and their detection) which will be used as predictors
  • Segmentation - both with regard to the level at which outcomes are assigned (e.g. edit, article) and with respect to input units

Some of this matters more for machine learning techniques, while we might also use some simple measures (e.g. looking at T1-T2 difference on article specific Wiki ToDo list) which depend on features as proxies for outcomes. On that Re: outcomes, one simple thing for machine learning particularly as a first step might just be: 1) was the edit positive, 2) negative, 3) neutral with respect to improving the article. We could then provide a breakdown of such edits. . We should also be cautious that our feature selection doesn't exclude some widely missed but important features (e.g. alt-text). The rubric below might be a good way to 'present back' and assess improvements (with some aggregation method probably for overall improvement).

Assessment area Scoring methods Score
Comprehensiveness Score based on how fully the article covers significant aspects of the topic. 1-10
Sourcing Score based on adequacy of inline citations and quality of sources relative to what is available. 0-6
Neutrality Score based on adherence to the Neutral Point of View policy. Scores decline rapidly with any problems with neutrality. 0-3
Readability Score based on how readable and well-written the article is. 0-3
Formatting Score based on quality of the article's layout and basic adherence to the Wikipedia Manual of Style 0-2
Illustrations Score based on how adequately the article is illustrated, within the constraints of acceptable copyright status. 0-2
Total 1-26


It's also the case that the salient features may vary with a combination of temporal and editor-interaction factors. Early stage articles benefit greatly from addition of different features to later stage ones (e.g. amongst others,5).

There's also an interesting point re: namespace contributions on talk and article pages, presumably in the first instance we're looking only at article contributions.

It is also worth noting that whatever we do, we should where possible consider implications for non-English Wikipedias, in particular the ways in which references are used are (I believe) different in different Wikipedias. This may well also be true of different projects.

Finally, we should also note if we did anything sucessfully, a number of benefits might also be gained including automation of quality (or semi-automation) within projects, etc., and potential for new editor engagement experiments e.g. sending editors to articles which we think might be easily improved (a more sophisticated 'wiki to do' tool). Sjgknight (talk) 10:34, 13 April 2014 (BST)

Just crossed my mind it's also worth noting the potential benefits for e.g. education being able to take the contributions of a particular editor, and look for particular features (e.g. use of 'cite' templates) across the contributions. That would have benefits outside of Wikipedia (on Mediawiki) where analytics on writing style and content could be conducted. Sjgknight (talk) 10:45, 13 April 2014 (BST)
Regarding assessment areas, take a look at Table 3 of Stvilia, Besiki, et al. "Information quality work organization in Wikipedia." Journal of the American society for information science and technology 59.6 (2008): 983-1001.. It gives a good list of criteria. For example, "Accessibility" has "caused by (1) Language barrier (2) Poor organization (3) Policy restrictions imposed by copyrights, Wikipedia internal policies, and automation scripts" and suggests actions such as "Reorganize, duplicate, remove, translate, split, join, rearrange"." See the whole list, I think it might be helpful for your chart, Simon. Jodi.a.schneider (talk) 10:48, 13 April 2014 (BST)
How would we want to deal with a contribution which is ostensibly 'valuable' (e.g. a good reference is added, a paragraph of well written, sourced, and intra-linked text is added) but which makes little impact on the overall quality of the article (e.g. a good para is added to a generally bad article)? Is this a good case in point for feedback on the contribution level, can looking at 'net improvement' provide useful info, etc. Sjgknight (talk) 16:42, 6 May 2014 (BST)

Further thoughts[edit source]

Just had a Skype with a colleague about some of this, mostly agreement on the comments above, some very quick points:

  1. What is being assessed? Articles, contributors, or contributions - my response to this is that we're interested in contributions although that might need some assessment of articles. Contributors are dealt with in other ways, and an assessment of them (although certainly important and interesting) doesn't deal with the core problem here - understanding whether the content added is of high quality
  2. Some of the things above are relatively easy to assess with current methods, e.g. readability, formatting and probably NPOV (as a 'genre'). Comprehensiveness would require some understanding of the field (possible, e.g. Watson, but a different problem), as would sourcing + some understanding of those sources.
  3. The things above miss some features we probably care about (e.g. adding references and content), but for a first pass that might be ok

It's also just crossed my mind an idea for dealing with contribution v. article improvement, might be to use some sort of distribution measure, I'm thinking of F measures which can weight precision/recall scores in information-retrieval (,etc.) tasks, so the analogue here would be that we might not only be interested in one excellent para being added (precision) but in overall improvements (recall). Maybe could be +ve/-ve depending on the quality of the contribution, gives more than just a T1-T2 article comparison to see if a contribution improved, no difference, or damaged (1,0,-1). Sjgknight (talk) 17:58, 22 May 2014 (BST)

Related wikimania talks and materials[edit source]

WikiTrust[edit source]

WikiTrust is looking for an adopter, have you considered hosting/supporting it? --Nemo bis (talk) 20:11, 16 April 2014 (BST)

I wasn't aware of that. Could certainly be discussed. Do you know who would be best person to contact? --MichaelMaggs (talk) 16:36, 21 April 2014 (BST)
This reminded me I'd seen that discussion, this is the only thread I could find, no resolution as far as that indicates. Sjgknight (talk) 16:45, 21 April 2014 (BST)

User-based quality measures[edit source]

Thanks for this work Michael! I think one of the most challenging aspects of quality is tying contributions to specific users, i.e. how to tie various programs or events to specific user contributions. Wikimetrics measures contributions, but is unable to measure whether contributions are of high "quality". During one of the discussions at the Wikimedia Conference, which will soon be posted on-wiki, there was a discussion about various methods for measuring quality. One of the themes brought up that might be interesting to pursue is to break up the measure of "quality" into different ideas, such as "popularity" or "appreciation"; but more discussion is definitely needed around that and how to measure. One of the main challenges about quality as well is that they vary significantly across various language projects. The measure of citations can vary across articles because other cultures have different customs around the idea of citations.

Another factor to consider, for example, is what is the benefit of the quality of an already pretty decent article versus the addition of new content to a stub. But this brings up the point that it could be risky to get too nit picky at the outset of measuring quality and that perhaps focusing on very general measures (such as number of headings, page views, etc.) might be more helpful and generalizable to all the projects. Thank you for starting the discussion around this topic! Looking forward to hearing your thoughts. I will also send you the link to the discussion around program outcomes when its posted. Regards - EGalvez (WMF) (talk) 21:59, 16 April 2014 (BST)

Comment by Charles Matthews (copied from wikimediauk-l)[edit source]

There's the old DREWS acronym from How Wikipedia Works, to which I'd now add T for traffic. In other words there are six factors that an experienced human would use to analyse quality, looking in particular for warning signs.

  • D = Discussion: crunch the talk page (20 archives = controversial, while no comments indicates possible neglect)
  • R = WikiProject rating, FWIW, if there is one.
  • E = Edit history. A single editor, or essentially only one editor with tweaking, is a warning sign. (Though not if it is me, obviously)
  • W = Writing. This would take some sort of text analysis. Work to do here. Includes detection of non-standard format, which would suggest neglect by experienced editors.
  • S = Sources. Count footnotes and so on.
  • T = Traffic. Pages at 100 hits per month are not getting many eyeballs. Warning sign. Very high traffic is another issue.

Seems to me that there is enough to bite on, here.


Comments from Tom[edit source]

So following on from my mailing list comment! I don't believe there is extensive technical work needed - it should be kept lightweight in terms of providing a way to pull a specific article from Wikipedia and apply definable metrics to it. To my mind the significant work is in evolving relevant metrics - perhaps for different use cases. So any tool should be carefully designed to be "pluggable" in the metrics you can apply. Some more thought, and development work, might be needed to support individual metric plugins - for example if we needed to create a corpus of data from many articles.

So to my mind what we really need is a framework which places a simple UI with good UX in front of a tool to pull an article from Wikipedia, and apply selected Metrics. These Metrics would be developed based on a documented plugin API so that e.g. *any* volunteer, stakeholder, or whomever, could build a Metric. Internally there should be a full test suite (good practice), Metric integration suite (determine user submitted Metric plugins were valid), and Metric submission process. Ideally the tool should be simple enough to install and use in a personal environment for a competent developer - for Metric development and custom installs.

We have all the bits in place to build this:

To a certain extent, this framework is entirely independent of actually deciding on Metrics to use.. and so could be specced and built in parallel with that discussion. I've done a bit of work on the concepts for this and would be happy to share them. It's quite an exciting project. I think it would probably not require too much in terms of development - probably 2 months (not full-time, obviously) from start to usable-prototype.

I've also got some comments related to metrics but will post them seperately to try and keep discussion lean. --ErrantX (talk) 12:43, 23 April 2014 (BST)

Relevant literature[edit source]

  1. Adler, B., de Alfaro, L., & Pye, I. (2010). Detecting wikipedia vandalism using wikitrust. Notebook Papers of CLEF, 1, 22–23.
  2. Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In Computational Linguistics and Intelligent Text Processing (pp. 277–288). Springer. Retrieved from
  3. Anderka, M., Stein, B., & Lipka, N. (2011). Towards automatic quality assurance in Wikipedia. In Proceedings of the 20th international conference companion on World wide web (pp. 5–6). ACM. Retrieved from
  4. Blumenstock, J. E. (2008a). Automatically assessing the quality of Wikipedia articles. School of Information. Retrieved from
  5. Blumenstock, J. E. (2008b). Size matters: word count as a measure of quality on wikipedia. In Proceedings of the 17th international conference on World Wide Web (pp. 1095–1096). ACM. Retrieved from
  6. Chevalier, F., Huot, S., & Fekete, J.-D. (2010). WikipediaViz: Conveying article quality for casual Wikipedia readers. In Pacific Visualization Symposium (PacificVis), 2010 IEEE (pp. 49–56). IEEE. Retrieved from
  7. Chin, S.-C., Street, W. N., Srinivasan, P., & Eichmann, D. (2010). Detecting Wikipedia vandalism with active learning and statistical language models. In Proceedings of the 4th workshop on Information credibility (pp. 3–10). ACM. Retrieved from
  8. Daxenberger, J., & Gurevych, I. (n.d.). Automatically Classifying Edit Categories in Wikipedia Revisions. Retrieved from
  9. De La Calzada, G. (2010). A Strategy Oriented, Machine Learning Approach to Automatic Quality Assessment of Wikipedia Articles. California Polytechnic State University. Retrieved from
  10. De la Calzada, G., & Dekhtyar, A. (2010). On measuring the quality of Wikipedia articles. In Proceedings of the 4th workshop on Information credibility (pp. 11–18). ACM. Retrieved from
  11. Druck, G., Miklau, G., & McCallum, A. (2008). Learning to predict the quality of contributions to wikipedia. Retrieved from
  12. Ferschke, O., Daxenberger, J., & Gurevych, I. (2013). A Survey of NLP Methods and Resources for Analyzing the Collaborative Writing Process in Wikipedia. In The People’s Web Meets NLP (pp. 121–160). Springer. Retrieved from
  13. Ferschke, O., Gurevych, I., & Rittberger, M. (2012). FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia. In CLEF (Online Working Notes/Labs/Workshop). Retrieved from
  14. Ferschke, O., Zesch, T., & Gurevych, I. (2011). Wikipedia revision toolkit: efficiently accessing Wikipedia’s edit history. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations (pp. 97–102). Association for Computational Linguistics. Retrieved from
  15. Han, J., Wang, C., Fu, X., & Chen, K. (2011). Probabilistic quality assessment of articles based on learning editing patterns. In Computer Science and Service System (CSSS), 2011 International Conference on (pp. 564–570). IEEE. Retrieved from
  16. Harpalani, M. (2010). Wiki Vandalysis. Wikipedia Vandalism Analysis. The Graduate School, Stony Brook University: Stony Brook, NY. Retrieved from
  17. Harpalani, M., Hart, M., Singh, S., Johnson, R., & Choi, Y. (2011). Language of vandalism: Improving Wikipedia vandalism detection via stylometric analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2 (pp. 83–88). Association for Computational Linguistics. Retrieved from
  18. Hasan Dalip, D., André Gonçalves, M., Cristo, M., & Calado, P. (2009). Automatic quality assessment of content created collaboratively by web communities: a case study of wikipedia. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries (pp. 295–304). ACM. Retrieved from
  19. Himoro, M. Y., Hanada, R., Cristo, M., & Pimentel, M. da G. C. (2013). An investigation of the relationship between the amount of extra-textual data and the quality of Wikipedia articles. In Proceedings of the 19th Brazilian symposium on Multimedia and the web (pp. 333–336). ACM. Retrieved from
  20. Hu, M., Lim, E.-P., & Krishnan, R. (2009). Predicting Outcome for Collaborative Featured Article Nomination in Wikipedia. In ICWSM. Retrieved from
  21. Javanmardi, S., & Lopes, C. (2010). Statistical measure of quality in Wikipedia. In Proceedings of the First Workshop on Social Media Analytics (pp. 132–138). ACM. Retrieved from
  22. Javanmardi, S., McDonald, D. W., & Lopes, C. V. (2011). Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (pp. 82–90). ACM. Retrieved from
  23. Lipka, N., & Stein, B. (2010). Identifying featured articles in Wikipedia: writing style matters. In Proceedings of the 19th international conference on World wide web (pp. 1147–1148). ACM. Retrieved from
  24. Mola-Velasco, S. M. (2011). Wikipedia vandalism detection. In Proceedings of the 20th international conference companion on World wide web (pp. 391–396). ACM. Retrieved from
  25. Moskaliuk, J., Rath, A., Devaurs, D., Weber, N., Lindstaedt, S., Kimmerle, J., & Cress, U. (2011). Automatic detection of accommodation steps as an indicator of knowledge maturing. Interacting with Computers, 23(3), 247–255.
  26. Neis, P., Goetz, M., & Zipf, A. (2012). Towards automatic vandalism detection in OpenStreetMap. ISPRS International Journal of Geo-Information, 1(3), 315–332.
  27. Okoli, C., Mehdi, M., Mesgari, M., Nielsen, F., & Lanamäki, A. (2012). The people’s encyclopedia under the gaze of the sages: A systematic review of scholarly research on Wikipedia. Available at SSRN. Retrieved from
  28. Rassbach, L., Pincock, T., & Mingus, B. (2007). Exploring the feasibility of automatically rating online article quality. In Proceedings of the 9th Joint Conference on Digital Libraries. Retrieved from
  29. Saengthongpattana, K., & Soonthornphisaj, N. (2014). Assessing the Quality of Thai Wikipedia Articles Using Concept and Statistical Features. In New Perspectives in Information Systems and Technologies, Volume 1 (pp. 513–523). Springer. Retrieved from
  30. Suzuki, Y. (2013). Effects of Implicit Positive Ratings for Quality Assessment of Wikipedia Articles. Journal of Information Processing, 21(2), 342–348.
  31. Tzekou, P., Stamou, S., Kirtsis, N., & Zotos, N. (2011). Quality Assessment of Wikipedia External Links. In WEBIST (pp. 248–254). Retrieved from
  32. Wang, S., & Iwaihara, M. (2011). Quality evaluation of wikipedia articles through edit history and editor groups. In Web Technologies and Applications (pp. 188–199). Springer. Retrieved from
  33. Warncke-Wang, M., Cosley, D., & Riedl, J. (2013). Tell me more: an actionable quality model for Wikipedia. In Proceedings of the 9th International Symposium on Open Collaboration (p. 8). ACM. Retrieved from
  34. Wöhner, T., & Peters, R. (2009). Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (p. 16). ACM. Retrieved from
  35. Wu, G., Harrigan, M., & Cunningham, P. (2012). Classifying Wikipedia articles using network motif counts and ratios. In Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration (p. 12). ACM. Retrieved from
  36. Wu, K., Zhu, Q., Zhao, Y., & Zheng, H. (2010). Mining the Factors Affecting the Quality of Wikipedia Articles. In Information Science and Management Engineering (ISME), 2010 International Conference of (Vol. 1, pp. 343–346). IEEE. Retrieved from
  37. Wu, Q., Irani, D., Pu, C., & Ramaswamy, L. (2010). Elusive vandalism detection in wikipedia: a text stability-based approach. In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1797–1800). ACM. Retrieved from
  38. Xiao, K., Li, B., He, P., & Yang, X. (2013). Detection of Article Qualities in the Chinese Wikipedia Based on C4. 5 Decision Tree. In Knowledge Science, Engineering and Management (pp. 444–452). Springer. Retrieved from

Sjgknight (talk) 09:37, 4 May 2014 (BST)

This is a long list, suitable for a post-grad researcher, but not most volunteers. If there were two or three relevant papers from this list of 38 that a bot creator like myself might find useful to read through, if I wanted to create a quality reporting bot, which would they be? -- (talk) 13:02, 4 May 2014 (BST)
Agreed. Quite possibly some key ones here, if I have time to work through them I'll try and write summaries/flag them. I didn't want people to go off looking for literature when I've probably already done a fairly thorough job on that bit though. There are a couple flagged on the main page (rather than this talk page) along with some important considerations re: scope, etc. which might be a place to start. Sjgknight (talk) 13:06, 4 May 2014 (BST)

A recent addition:

39. Jingyu Han, J & Chen, K (2014). Ranking Wikipedia article's data quality by learning dimension distributions

--MichaelMaggs (talk) 08:30, 11 August 2014 (BST)

A relevant tool, with "missing features" including ones to identify (a) wikipedia editorial conventions & (b) grammar/spelling changes, including the markup changes languagetool missing features