Project grants/Pronunciations of words for Wiktionary

From Wikimedia UK
Jump to navigation Jump to search

Basic information[edit | edit source]

Project Title (If applicable)
Microphone to record pronunciations of words for Wiktionary
Proposed by
Martin Michlmayr (Wikimedia user MartinMichlmayr)
Are you currently a member of Wikimedia UK?
No (I read the requirements; it seems this is not enforced but I can join if it's a problem)

Project description[edit | edit source]

Briefly describe the issue or problem that motivates this application. What needs are you meeting?
Wiktionary is a dictionary that contains many words in different languages. While Wiktionary explains the meaning of words, it's also important to know how they are correctly pronounced. When an audio file for a word is available on Wikimedia Commons, the pronunciation can be linked on Wiktionary. This greatly helps users of Wiktionary.
Sharing these pronunciations on Wikimedia Commons freely also offers many other use cases. For example, pronunciations can be used in Anki vocabulary decks. (Anki is a spaced repetition system that allows people to memorise certain things.)
The proposed budget is:
Description Cost
Rode NT-mini USB £85
Pop filter ~£20
Boom arm (stand) ~£20
Refreshments for volunteers £40
Total £165
I am currently based in the Philippines. I'm applying to Wikimedia UK as it directly fits into the mission and there are no grants from Wikimedia Foundation for this request. I used to be a UK resident.
Describe project activities. What will you use the funding to do?
Together with a Kenyan friend, I recorded several thousand pronunciations of Swahili words. I would like to complete this project and record several more thousand words, ideally covering the whole Swahili dictionary.
So far I used a pretty standard microphone. While the quality is okay(ish), I think a better microphone would lead to much better quality. It would also reduce the amount of work since we had to throw away a lot of recordings due to noise issues.
Hopefully, I can also use this microphone to record pronunciations for other languages in the future. I have the process of recording pronunciations figured out fairly well (recording, splitting files, uploading, etc).
Proof of work so far: https://commons.wikimedia.org/wiki/Category:Audio_recordings_by_Waithera_Were
Describe your plan for evaluating this project. How will you measure success? What types of things will you measure (e.g. content, participants)?
In terms of content, the project will be a success if we upload several thousand Swahili pronunciations.
A stretch goal is to upload pronunciations for other languages. I'd like to tackle Cebuano and Tagalog (two languages spoken in the Philippines), but that may not be possible.
Another stretch goal is to record Swahili sample sentences for the Swahili primer on Wikibooks: https://en.wikibooks.org/wiki/Swahili
Identify key people involved in this project. How will or could the wider Wikimedian community be involved?
The work will primarily be done by me and initially with my Kenyan friend. I hope to recruit other volunteers who can help with other languages.
If applicable, identify partnering organisations for this project (not essential)
N/A

If you feel that there is more information that could be for example resources needed, how successes can be measured, and how it fits in with the aims of Wikimedia and Wikimedia UK. Please note that these answers don't have to be definite now, and can be expanded on in conversation with the programme team.

What targets have you set? What will you measure?
The target is to cover the whole dictionary (including words currently missing from Wiktionary). I am working on generating a word list based on various resources.
What contribution will the project make to our strategic goals?
This project directly contributes to open access to knowledge. Specifically, Swahili is a popular language (150+ million speakers) for which very few resources are available. (And the resources which are available are not open knowledge)
Who will be recording/measuring the project metrics, and writing up a project report?
Martin Michlmayr
What staff support is being requested?
None
How can you get other volunteers involved? What roles could they have?
I intend to recruit other volunteers who can do recordings of words in their native languages
What meeting or other space is needed?
None
Are other resources needed (such as computers, books, camera equipment, food, contacts, infrastructure)? How will they be sourced?
If this grant is approved, I will buy a microphone, pop filter and microphone stand (boom arm). I have all other equipment (computer) and tools (script to upload to Wikimedia).
We could add £40 to buy pizza and drinks for volunteers. After all, it's hard to speak with a dry mouth ;). But this isn't strictly necessary. (Added to budget: feel free to remove)
If any partner organisations have been identified, have they been contacted and are they committed?
N/A
Does this project require more extensive funding? What would any WMUK funds be used for?
No
Are external funds needed that we can apply for? If so where will they be sought?
No
Are there any resources that you can contribute? Such as equipment.
N/A


Talk:Project grants/Pronunciations of words for Wiktionary

Project updates[edit | edit source]

May 2022[edit | edit source]

Despite applying for this grant back in December 2021, the project only started in April 2022. Because of this, this first progress report will mostly focus on problems, challenges and observations rather than on outcomes. However, rapid progress is being made.

Audio quality[edit | edit source]

We had previously recorded over 8,000 Swahili pronunciations. While the quality was okay, I wasn't entirely happy with the result, which was the reason for applying for this project grant. After looking at many reviews, I decided on the RØDE NT-USB Mini.

Despite the new microphone, I was initially still not entirely happy with the result. After playing around some more and trying a number of things, I settled on a process that leads to high quality recordings. (Adding a pop filter, removing sources of noise in the environment, doing noise reduction in Audacity, etc.)

Another issue I ran into is that there are no best practices for Wikimedia Commons pronunciation files. For example, in my old recordings I added a 300ms pause at the beginning of files. Now I believe that such a pause is too long. Unfortunately, there is no guidance on this question, or related questions (such as what volume should be used).

There are many differences in pronunciation files in Wikimedia Commons, not just in terms of quality, but in those little details that could easily be standardised. (Also what the structure of pages on Commons as well as structured data should be for such files.) I intend to start some discussions on these topics on Wikimedia Commons, but it seems there's little discussion regarding pronunciation files.

An overall question I often struggle with is: what is good enough? The audio quality is pretty good now, but it's definitely not perfect. A better pop filter, a better environment (e.g. a recording studio), etc, would lead to even better results. Same with the questions about best practices. It's of course important to strive for high quality, but things are never going to be perfect and it's important not to get bogged down by every little detail.

However, I struggle with this question in relation to the previous audio recordings we did. We already uploaded 8,000 files. The quality of the new recordings is much better. Would it be worth to redo the previous recordings? There are two good arguments in favour:

  • The initial 8,000 recordings cover the most important words and these should have the best quality
  • Having the same consistency and quality for all recordings would be nice

On the other hand, doing 8,000 words again is a lot of effort. My Kenyan friend, who does the recordings, is very patient but I'm sure there's a limit. Editing the files and doing QA also takes substantial time.

Finding words[edit | edit source]

The lofty goal of this project was to "cover the whole dictionary" (for Swahili), but what does that mean in practice?

First, what is a good source of words? I used Wiktionary as the starting point, as I want to create pronunciation files that can be used on Wiktionary. Wiktionary, of course, is not complete. But what I noticed during this work is that various word forms for words that are on Wiktionary are missing.

For example, there are around 4,100 Swahili nouns, but only 1,501 plural nouns. While you can't expect the number of plurals to be exactly twice the number of singular nouns (as some nouns don't have a plural and some plural forms are actually the same as the singular), this shows that a lot of pages are missing on Wiktionary.

There are bots on Wiktionary that create such forms automatically from the main entry, but there's no such bot for Swahili. This is something I hope to work on.

Another problem is that Wiktionary is not complete. I have identified a number of other sources for Swahili words (which hopefully will also be added to Wiktionary later and then we'll already have a pronunciation file).

In other words, the project to record audio has let to several related projects to find words and to improve Wiktionary.

Second, what does the "whole dictionary" mean? For example, let's take the English word "walk". It has many variants, such as "walks", "walking", and "walked". Is it reasonable (and even necessary) to record all of these variants? Swahili has a lot of variants and I don't think it's possible to record all of them.

My main focus right now is on:

  • Verbs: stems and infinitives
  • Nouns: singular and plural
  • Other lemma forms

Accomplishments[edit | edit source]

I've refined the recording process and I'm now happy with the audio quality. There are still some open questions in terms of best practices that I'd like to resolve before uploading the files.

We recorded over 1,700 new pronunciations. This consists of ~580 lemmas that have been added to Wiktionary in the last two years and ~1300 plural nouns.

Next steps[edit | edit source]

  • Create pages on Wiktionary (plural nouns, infinitive verbs, etc)
  • Find other sources for Swahili words
  • Record more words
  • Resolve outstanding questions on best practices for pronunciation files and upload to Wikimedia Commons
  • Document my audio recording process in more detail

Accomplishments[edit | edit source]