Macrogrants/Wikimedia Commons Geograph and Avionics batch upload projects support

From Wikimedia UK
Jump to: navigation, search
Example Geograph project image of the Eden Project, 2006. 3 categories have been added by Faebot, Cornwall being a test region for County identification.
Objective

This application is for supporting funding for two of the largest non-GLAM projects on Wikimedia Commons. These projects are entirely driven by unpaid volunteers and have a track record of delivering huge amounts of valued content for Wikipedia in many languages.

Goals
  • Deliver 100,000 uploaded and categorized quality amateur Avionics photographs from a selection of forums, with releases from the photographers on record. Project page: Commons:Commons:Batch uploading/Airliners.
  • Geographic categorization of UK Geograph images (currently just under 2 million on Commons) and refresh the collection with those now available in higher resolutions and update with additional photographs (a total current collection of 3.7 million photographs). Project page: Commons:User:Faebot/Geograph.
Resources
  • Volunteer resources. These are long running projects spanning years rather than months and requiring regular maintenance tasks when complete. Highly reliant on volunteer time the plan needs to be flexible but firm announcements about the projects would be available for Wikimania 2014 and the main deliverables would complete by the end of 2014. The principle contributors in 2013 for the avionics project have been Russavia and Fæ with wide support from more than 150 other contributors. Time from principle contributors has been of the order of more than 10 hours per week. The Geograph project took a lot of development and test time in 2012 but is now less demanding, regular set-up and maintenance is of the order of 10 hours per month of Fæ's time.
  • Communications and hardware. The bandwidth costs have been high. A key reliability issue has been Fæ's internet connection and an inability to do any batch image processing apart from simple cropping (using a part time old notebook as a Windows installation). Video processing requests (such as conversions to OGV) have been rejected in the past due to this lack of power rather than lack of volunteer time.
    • The primary machine for this work is a maxed-out 2009 Macmini running OSX 10.5. This means that Python-scripted image processing is limited or impossible. It is proposed that a devoted machine running OS X Mountain Lion is purchased specifically to support Faebot's activities (currently the most active bot on Commons with a track record of over 2.2 million edits), this will provide much needed disk capacity and open potential for audio and video processing as well as supporting more complex image processing and identification issues on batch uploaded images. Current price is £499[1] (standard John Lewis price with 2 year warranty)
    • Bandwidth use has been high (capped at 40GB/month). An upgrade to a higher bandwidth service cost an extra £10/month, it is proposed that half of the years' costs are covered by a one-off grant in 2014 to support Faebot's batch processing related activities (£60).
    • WMUK previously agreed to pay for a £15 memory stick to reduce the likelyhood of hard-drive damage, though this has yet to be claimed for. Considering a 32GB stick will probably not be sufficient to cope with a full xml dump from Commons in 2014 (needed for the Geograph project), it is proposed that a 64GB usb stick is purchased with expected costs around £30.[1]
  • Staff resources. None.
  • Expenses. Limited to postage, perhaps £10, no travel is expected.
  • Access costs. An obstacle to uploading some sets of restricted images (but where a licence release is on record in OTRS) has been that we require membership for some of the forums. The membership cost for Airliners.net (the main resource so far) is $55 for a year, there are options for taking 3 month or 6 month memberships that may be suitable. Around 9 forums are on our target list and a general membership budget of up to £100 may be sufficient as and when these purchases will have the most benefit.

The total budget to support the above is estimated at £650.

Constraints

None.

These projects are noted for being both engaging for "gnomic" volunteers and independent of the WMF or WMF managed tools. This probably remains desirable even if promotion of the outcomes (the images then available for reuse on all projects) may appeal to "front-end" volunteers, the methods could be popular to present at events such as Wikimania or for more focussed workshops on how to manage large batch projects on Commons and in the longer term there may be regular maintenance or housekeeping bot tasks that could transition to WMF servers.

Outcomes
  • 100,000 Avionics photographs checked and categorized will provide an independent and non-commercial world standard reference base for aircraft of all model types in all airline liveries. A consequence will be a consistent standard for using ICAO codes on all Airport categories, along with their geo-coordinates and photographs of the majority of airports, military air bases and air fields in all countries.
  • Consensus for the project methods of automatic Geo-categorization of the millions of photographs in the Geograph collection. This has currently been limited to UK County/Authority level due to the doubts about accuracy and a lack of standardization for naming lower level categories such as villages. An automated link using WikiData may be possible in 2014, though this will also require cross-project consensus. This has been an issue without firm consensus for several years.
  • Throughout 2014 a series of published tests, case studies and on-Commons guidelines for:
    • Best practices for using Ordnance Survey Open Data to categorize images on Commons by location.
    • Python and Pywikipediabot techniques for identifying and removing standard watermarks and credit bars from batch uploads. This may include the use of SciPy or similar open source tools to analyse and correct images.
    • Identification of near duplicates and copyright problems through image matching.
    • Using EXIF data for improved categorization and finding suspect images.
  • Ad-hoc outcomes from supporting Faebot. A track record of successful batch upload and bot projects tends to attract suggestions from the Commons Community. As an example User:odder proposed the creation of Commons:User:Noaabot to maintain daily USA weather maps on Commons (which is supported using the same machine as Faebot) and Llywelyn2000 proposed a (Fair Use & PD) Welsh book covers project with a maintained project dashboard on the Welsh wikipedia cy:Wicipedia:Wicibrosiect_Llyfrau_Gwales/dangosfwrdd; these active projects rely on Faebot being available every day of the year.
Risks
  • This project is highly reliant on Fæ's time and Russavia's expertise for avionics. Project pages such as Commons:Batch_uploading/Airliners actively encourage participation and it may be possible to get another bot operator interested in the relevant maintenance scripts that Faebot relies on. However there are no hard deadlines, so temporary illness or unavailability of a volunteer should not affect the long term outcome.
Discussion
  • Would this be better done by an online server (or Wikimedia Labs?) rather than a computer on a home broadband connection? Thanks. Mike Peel (talk) 16:14, 15 November 2013 (UTC)
  • I agree that highly stable daily routines like the non-image processing parts of Noaabot, or the weekly upload from the MOD (with on-going positive cooperation from the Ministry of Defence Imagery Team), could benefit by moving to WMF servers (both Faebot and Noaabot have been recently set up at tools.wmflabs.org). Remember these are not intended to be external facing tools, or additions to MediaWiki. Investigating this and making it happen was something I was planning on looking at in 2014, guessing it will take a month or two to pick away at it, having never done it before. There being little training for any of this, Wikimedia documentation and manuals intended for volunteers remain in a hard to use state, which means I would have to commit quite a proportion of spare time to auto-didactically working it out, diverting from pragmatic delivery. Farming stable pieces of Faebot out to the WMF servers is an expected outcome in 2014, though the successful track record of doing the creative development parts of these projects at the client end says a lot for keeping them a hackish local set-up.
Case: Image processing and experiments: There are benefits to being able to process images and run experiments locally (including tweaking and testing code even when my connection is unreliable, as has happened many times in the past, currently this bug caused by WMF server problems has not stopped Faebot developing). Much of Faebot's activities remain rather ad-hoc and adapt as problems arise, it is the ease of adaptation that has resulted in Faebot becoming the most active bot on Commons. In fact, in terms of image processing, apart from running SciPy based transforms, I find it hard to imagine resolving the embossed watermarks issue in a virtual environment. The process I use for cropping ended up a mix between Windows tools and OSX tools, with other image problems even relying on work-arounds using Photoshop macros and past attempts at video re-compression using a workflow piping the files to external sites and redoing in a local VLC setup. Any of these sorts of fixes would be made so complex going via a virtual environment (or dubious if needing to install odd one-off tools in the remove environment) that I would give up at a planning stage before experimentation started.
Case: Avionics mappings: For the Avionics project each forum is a separate configuration using BeautifulSoup to carefully capture the image metadata, each author has their own template which I have had to fiddle around with to get the mapping right, the airports list extends as we go along, which means Faebot halts, tells me there is a mapping needed and I hand update the category map (if I can find one, otherwise I make a human decision on whether to create it or leave it for the project team) along with my best guess for string matching which may have value for other forums. If we integrate the complex process of detecting and removing embossed watermarks this would be highly experimental, I don't see any value of pushing this out to a remote server.
Case: Geograph: For Geograph things are more stable with runs for regions (like Wales) taking a month or two to complete, but again I would separate out the uniquely mapped batch uploads, along with their heuristic category mappings, from what might become maintenance tasks for Geograph longer term. The potential next stage of mapping to sub-County level will require significant small-run testing and experimentation, again best done locally where suck-it-and-see along with a lot of invisible passive tests is an existing highly successful approach. Part of the project will be the end transition to maintenance which may well be either on WMF servers, or run elsewhere but with the code published centrally to ensure operator handover can work should the worst happen.
-- (talk) 01:11, 18 November 2013 (UTC)
I don't know about 'an online server (or Wikimedia Labs?)', Mike, but the work Fae's doing uploading book covers on cy is superb! Robin Owain (WMUK) (talk) 01:23, 21 November 2013 (UTC)
Comments from Richard Symonds

I think it might be useful if I provided my opinion here.

  • I am concerned that this may be seen by the community as paying someone to edit Commons.
  • On the other hand, the benefits of Fae's work are staggering and the bot he runs really benefits smaller projects.
  • I am unsure about the educational benefit of lots of pictures of aeroplanes (we probably have enough), but your mileage may vary.
  • Volunteer time for this is quite low - only 10 hours/month - which means large benefits for relatively little effort. Supporting this would presumably decrease volunteer time further, freeing up Fae to work on other projects.
  • Bandwidth costs are very dodgy ground. It would be very difficult to justify paying for bandwidth costs for volunteers regardless of how much work they do. Loaning equipment is one thing, but there would be no way (beyond trust) to ensure that the bandwidth was actually being used for charitable purposes.
  • Video conversions are a laudable goal but this would be best done on the WMF's end (or similar), rather than on a home PC. We should push some weight towards making this happen.
  • The loan of equipment would be possibly but I am again unsure that we can justify purchasing a £700 macbook for a volunteer to use. Perhaps a long-term loan agreement would be possible (on the order of three years for the loan?)
  • We can do better than a 64gb USB drive - we can loan a 500gb usb drive from the office without much of a problem.
  • Staff resources would not quite be none.
  • Access costs for forums seems a good use of funds and I would happily pay for this even if the rest was declined.

Ultimately the decision is not mine, but I hope this helps. Richard Symonds (WMUK) (talk) 13:44, 21 November 2013 (UTC)

Points of clarification
  1. This is a compact devoted box costing under £500, refer to the link provided, it is not a "£700 macbook", nor a laptop. It would not replace my every-day desktop nor my notebook/laptop.
  2. The video processing work mentioned includes re-formatting that would never be done at the WMF server end. The WMF servers are unlikely to ever be able to process a wide range of proprietary formats, and may be liable for licence fees should this happen for a small sub-set of popular formats.
  3. Image analysis is not the same thing as changing formats (such as from GIF to PNG as Noaabot does). Intelligently removing, or even detecting, embedded watermarks will never be fully automated and is a complex service that is unlikely to ever be run in the type of virtual environment that WMFlabs offers and even if sub-projects can be run this way, the volunteer overheads of making this work are likely to become impractical.
  4. As in the microgrant for the 32GB USB stick, the basic reliability stats show it is more cost effective to use SSD technology rather than standard hard disks for high volume reading of a Commons XML dump. It is not a constraint of disk space, it is a constraint of reliability for this particular task.
-- (talk) 20:54, 21 November 2013 (UTC)
On the point about a long-term loan: I'm not sure that make sense, since the Finance Policy states that "The value of any such asset will be depreciated on a three-year, straight line basis." So if it were loaned for 3 years, the finance policy would value it at nothing at the end of the loan, which seems a bit pointless. That said, any equipment purchased under a micro/macrogrant should remain the property of WMUK, so perhaps it's a non-issue. Thanks. Mike Peel (talk) 05:47, 24 November 2013 (UTC)
Comments from Chris Keating

This looks like a great proposal. I would be very happy for us to support this with one caveat and one question. The caveat is that the computer provided would need to remain the property of Wikimedia UK (so that it would be available for future projects when this one is complete, or if this one is aborted early - as is standard in these situations. The question is regarding the bandwidth element. I don't have a problem in principle with requests for funding bandwidth of this nature as it is clearly a project that involves well in excess of the normal home broadband bandwidth usage, and the amount requested takes into consideration the fact that the additional bandwidth may not entirely be used for the purposes of this project. But I would ask - how substantial is the bandwidth constraint on the effectiveness of the project? Many thanks to Fae for putting this forward. Regards, The Land (talk) 14:21, 24 November 2013 (UTC)

I am not sure I understand the question. I can give a little background on how my home telecoms set-up has been changed to deliver these projects. At the beginning of the year I hit my maximum bandwidth allowance with my current provider (I think "fair use" was defined as peaking at 40GB/month, but the providers are cagey about defining this) entirely due to my work on Commons, resulting in my usage being significantly throttled. After a bit of negotiation I moved to an unlimited deal on fibre-optic. I estimate my Commons related uploads/downloads to be more than 90% of my bandwidth use comparing my provider's log and the wmflabs uploadsum tool (I think my non-project usage in a month hovers under 4GB). When reading these numbers, account should be taken of the fact that upload sizes must mean at least equivalent download sizes, and that bandwidth may be eaten up by background tasks such as checking file hashes from external sites, and may mean downloading a prospective file and then discarding it after duplicate checks or examining the EXIF data. Looking at my account for the first 3 weeks of November, the live data shows me that I am just at 50GB, so will probably reach over 65GB this month, even though I have paused my Airliners uploads in the past week being on a wikibreak to handle other stuff.
What is proposed is a nominal rather than realistic contribution to these running costs, which effectively covers not just bandwidth but also electricity for running this fairly efficient compact machine full time (which I have not estimated, but Googling other standard measures puts this at over £5/month, which could be an alternative justification for £60/year).
In terms of effectiveness, if I went back to a "limited" telecoms deal, then I would probably have to find a way to choke the projects to safely staying at under 20GB/month, so around 1/3 of what is done today. -- (talk) 16:50, 24 November 2013 (UTC)
That does answer my (inelegantly phrased) question, thanks. Happy to support this. The Land (talk) 19:46, 25 November 2013 (UTC)
Comments from Michael Maggs

I would hope that WMUK will approve this. The bot work that Fae has been doing to improve Commons content over an extended period is exceptional, and if he has now reached a limit on what can be achieved with his own hardware it makes perfect sense for WMUK to provide financial support to help lift that limitation. No doubt it would be unusual for us to provide someone with a computer for sole use at home, but we have to balance the cost to the charity against the potential benefits to be achieved, and here the potential benefits are so high this looks like a pretty effective use of funds. Normal practice would be for the charity to purchase the hardware (computer plus external drive if needed) and then to provide it on loan for an extended period, to be returned at the end of that time, or earlier if called for or if the project ceases before then. As a registered charity we are legally allowed to provide funding solely for charitable purposes, but I do not see that as an issue as Fae has said the the hardware will be dedicated to the project. We could not of course pay for all home bandwidth costs, as some of those costs would be attributable to personal (non charitable) uses, but I see no reason why we should not agree to pay a suitable proportion while the project is ongoing. If Fae makes a realistic estimate of the proportion, and could agree to update that estimate regularly, I would be perfectly happy to proceed on that basis. For such a long term and productive editor there is absolutely no need for us to insist on irrevocable proof that every single bit of bandwidth has been used for charitable purposes. Evidence that the funds are being spent reasonably would come naturally from Fae's project reports detailing numbers of images (and MB) uploaded and so on, and that should be fine. Fae, could you agree to provide six-monthly reports, say, in some agreed format? --MichaelMaggs (talk) 15:36, 25 November 2013 (UTC)

6-monthly seems a sensible way to avoid worrying about explaining routine peaks and troughs (like switching everything off when on holiday, or pausing everything for a few days if there is a server problem). The outcomes are logged as part of the project pages and so long as the WMF keeps their tools available, there are auditable historic upload reports around to reuse. I have not yet got into using SQL to make my own reports from the main database (a process I hope to work out next year), but that might be a way of automatically producing a credible specific quarterly report once set up. My telecoms provider gives the current month bandwidth use as an embedded online graphic, and does not appear to track it longer term as part of billing, I am unsure if it would be worth the hassle of tracking it unless there were specific questions to look at. -- (talk) 17:32, 25 November 2013 (UTC)

Approval

Fæ, based on the comments here by members of the community, the Chief Executive is happy to approve this application on the following basis:

  • Purchase of the proposed Mac Mini to be used as proposed.
  • One-off grant of £60 towards the cost of bandwidth use for the projects.
  • Purchase of an appropriate 64GB SDXC card to be used with the Mac Mini. As a result, the previously approved application of a 32GB USB stick will be considered void.
  • Budget of up to £100 for relevant forums subscription, to be used when appropriate. Please report spending on this item separately, at the time it is made. A couple of sentence on which forum it is, and why it was selected on this page will be sufficient.
  • Per standard practice, the Mac Mini & SDXC will remain property of Wikimedia UK to be returned when the projects is complete or aborted, or earlier under exceptionally circumstances.
  • Publication of Faebot source code under a suitable open source licence so that other people can benefit from it and and/or suggest improvement.
  • If the use of the purchased equipments moves into the uploading of content that is out of scope of the original application, this must be discussed with Wikimedia UK beforehand.
  • Half-yearly reports of the projects' progress, to feature a breakdown of information including, but not limited to, the number of new images uploaded, the number of existing images updated with higher resolution, the number of unique images checked and categorised, etc. Reports should otherwise be made in the format as shown on Microgrants/Template/Report.
  • Any media files uploaded (either new or as higher resolution version) as part of the projects to be tagged with {{Supported by Wikimedia UK|year=XXXX}}, with "XXXX" to be substituted with the year the upload is taking place. (The year parameter is not currently part of the linked to template, but it is envisaged that some form of it will be.)

Thank you for your application. I look forward to seeing continued successful outcomes from the projects on Commons. Please let me know whether you would like the purchase to be made by the office and posted to you, or whether you will make the purchase yourself and claim the expenses afterwards. -- Katie Chan (WMUK) (talk) 11:37, 16 December 2013 (UTC)

Thanks for the update here. When the new machine is up and running, I plan to add a declaration on Commons, on my user page there, and make a note about it on the related project pages. Due to non-Wikimedia related urgent stuff, I am likely to remain on a wikibreak until the beginning of January and will not start considering how to sort out purchases until then. -- (talk) 16:13, 16 December 2013 (UTC)

Further clarifications
  • I purchased a USB stick in compliance with Microgrants/32GB usb stick for Commons dump, I still intend to submit a claim for it and it has been in use as proposed on that Microgrant. I doubt that the intention here is to withdraw grants retrospectively after the money has been spent by a volunteer, and after Wikimedia has benefited from the associated content generation. -- (talk) 15:54, 16 December 2013 (UTC)
  • No, it wasn't. I thought you meant from your comment above that you were still to made the purchase. I have struck out that part of my earlier statement. Regards -- Katie Chan (WMUK) (talk) 15:57, 16 December 2013 (UTC)
Thanks for prompt confirmation. -- (talk) 16:13, 16 December 2013 (UTC)
Hi Fae, could you please send in a claim ASAP (by email, by post (FREEPOST WIKIPEDIA), whichever is easiest)? The 32GB USB stick grant happened over a year ago, and while I know you've been busy, it will not make our accountant happy. I also don't want to make a habit of paying year-old expense claims, as it's not good practice. Please forward the receipt to Katie as soon as you can - certainly before 31 January (our year end). Richard Symonds (WMUK) (talk) 16:25, 16 December 2013 (UTC)
I can look at this when I am back in London in January, I have the receipt sitting on a bill-hook. I believe there are some outstanding claimed expenses that were never paid, but as I do not get any positive notification of payment, I have been putting off the time needed to back track over several months of statements to check the non-payments. -- (talk) 16:47, 16 December 2013 (UTC)
Thanks! Richard Symonds (WMUK) (talk) 16:48, 16 December 2013 (UTC)

Good. I'm glad the CE has agreed to this application. It sounds an excellent use of the charity's funds. --MichaelMaggs (talk) 17:36, 16 December 2013 (UTC)

Progress update

Hi Fæ, it's been 6 months since the purchase of the Macmini, and I see there's been great work so far. Thank you! Per the approval condition above: "Publication of Faebot source code under a suitable open source licence so that other people can benefit from it and and/or suggest improvement." Can you please point me to where the source code is? Thanks -- Katie Chan (WMUK) (talk) 12:10, 10 June 2014 (BST)

I'll add relevant links to WMUK report. The Avionics source has been added here. Russavia has been keen for me to arrange membership of the Russianplanes forum, which is covered by this proposal, and should result in the order of 100,000 further photographs later this year. I shall email you separately. -- (talk) 13:28, 10 June 2014 (BST)

Footnotes

  1. In the light that the Macmini now has an integrated SXDC card port, a memory card of this type may be a better purchase for the same budget and perhaps with the same capacity again depending on make. This would effectively become a permanent internal static disc for caching xml dumps from the projects, or collections of media files undergoing processing, as needed.