Project grants/Pronunciations of words for Wiktionary

From Wikimedia UK
Jump to navigation Jump to search

Basic information

Project Title (If applicable)
Microphone to record pronunciations of words for Wiktionary
Proposed by
Martin Michlmayr (Wikimedia user MartinMichlmayr)
Are you currently a member of Wikimedia UK?
No (I read the requirements; it seems this is not enforced but I can join if it's a problem)

Project description

Briefly describe the issue or problem that motivates this application. What needs are you meeting?
Wiktionary is a dictionary that contains many words in different languages. While Wiktionary explains the meaning of words, it's also important to know how they are correctly pronounced. When an audio file for a word is available on Wikimedia Commons, the pronunciation can be linked on Wiktionary. This greatly helps users of Wiktionary.
Sharing these pronunciations on Wikimedia Commons freely also offers many other use cases. For example, pronunciations can be used in Anki vocabulary decks. (Anki is a spaced repetition system that allows people to memorise certain things.)
The proposed budget is:
Description Cost
Rode NT-mini USB £85
Pop filter ~£20
Boom arm (stand) ~£20
Refreshments for volunteers £40
Total £165
I am currently based in the Philippines. I'm applying to Wikimedia UK as it directly fits into the mission and there are no grants from Wikimedia Foundation for this request. I used to be a UK resident.
Describe project activities. What will you use the funding to do?
Together with a Kenyan friend, I recorded several thousand pronunciations of Swahili words. I would like to complete this project and record several more thousand words, ideally covering the whole Swahili dictionary.
So far I used a pretty standard microphone. While the quality is okay(ish), I think a better microphone would lead to much better quality. It would also reduce the amount of work since we had to throw away a lot of recordings due to noise issues.
Hopefully, I can also use this microphone to record pronunciations for other languages in the future. I have the process of recording pronunciations figured out fairly well (recording, splitting files, uploading, etc).
Proof of work so far: https://commons.wikimedia.org/wiki/Category:Audio_recordings_by_Waithera_Were
Describe your plan for evaluating this project. How will you measure success? What types of things will you measure (e.g. content, participants)?
In terms of content, the project will be a success if we upload several thousand Swahili pronunciations.
A stretch goal is to upload pronunciations for other languages. I'd like to tackle Cebuano and Tagalog (two languages spoken in the Philippines), but that may not be possible.
Another stretch goal is to record Swahili sample sentences for the Swahili primer on Wikibooks: https://en.wikibooks.org/wiki/Swahili
Identify key people involved in this project. How will or could the wider Wikimedian community be involved?
The work will primarily be done by me and initially with my Kenyan friend. I hope to recruit other volunteers who can help with other languages.
If applicable, identify partnering organisations for this project (not essential)
N/A

If you feel that there is more information that could be for example resources needed, how successes can be measured, and how it fits in with the aims of Wikimedia and Wikimedia UK. Please note that these answers don't have to be definite now, and can be expanded on in conversation with the programme team.

What targets have you set? What will you measure?
The target is to cover the whole dictionary (including words currently missing from Wiktionary). I am working on generating a word list based on various resources.
What contribution will the project make to our strategic goals?
This project directly contributes to open access to knowledge. Specifically, Swahili is a popular language (150+ million speakers) for which very few resources are available. (And the resources which are available are not open knowledge)
Who will be recording/measuring the project metrics, and writing up a project report?
Martin Michlmayr
What staff support is being requested?
None
How can you get other volunteers involved? What roles could they have?
I intend to recruit other volunteers who can do recordings of words in their native languages
What meeting or other space is needed?
None
Are other resources needed (such as computers, books, camera equipment, food, contacts, infrastructure)? How will they be sourced?
If this grant is approved, I will buy a microphone, pop filter and microphone stand (boom arm). I have all other equipment (computer) and tools (script to upload to Wikimedia).
We could add £40 to buy pizza and drinks for volunteers. After all, it's hard to speak with a dry mouth ;). But this isn't strictly necessary. (Added to budget: feel free to remove)
If any partner organisations have been identified, have they been contacted and are they committed?
N/A
Does this project require more extensive funding? What would any WMUK funds be used for?
No
Are external funds needed that we can apply for? If so where will they be sought?
No
Are there any resources that you can contribute? Such as equipment.
N/A


Talk:Project grants/Pronunciations of words for Wiktionary

Project updates

May 2022

Despite applying for this grant back in December 2021, the project only started in April 2022. Because of this, this first progress report will mostly focus on problems, challenges and observations rather than on outcomes. However, rapid progress is being made.

Audio quality

We had previously recorded over 8,000 Swahili pronunciations. While the quality was okay, I wasn't entirely happy with the result, which was the reason for applying for this project grant. After looking at many reviews, I decided on the RØDE NT-USB Mini.

Despite the new microphone, I was initially still not entirely happy with the result. After playing around some more and trying a number of things, I settled on a process that leads to high quality recordings. (Adding a pop filter, removing sources of noise in the environment, doing noise reduction in Audacity, etc.)

Another issue I ran into is that there are no best practices for Wikimedia Commons pronunciation files. For example, in my old recordings I added a 300ms pause at the beginning of files. Now I believe that such a pause is too long. Unfortunately, there is no guidance on this question, or related questions (such as what volume should be used).

There are many differences in pronunciation files in Wikimedia Commons, not just in terms of quality, but in those little details that could easily be standardised. (Also what the structure of pages on Commons as well as structured data should be for such files.) I intend to start some discussions on these topics on Wikimedia Commons, but it seems there's little discussion regarding pronunciation files.

An overall question I often struggle with is: what is good enough? The audio quality is pretty good now, but it's definitely not perfect. A better pop filter, a better environment (e.g. a recording studio), etc, would lead to even better results. Same with the questions about best practices. It's of course important to strive for high quality, but things are never going to be perfect and it's important not to get bogged down by every little detail.

However, I struggle with this question in relation to the previous audio recordings we did. We already uploaded 8,000 files. The quality of the new recordings is much better. Would it be worth to redo the previous recordings? There are two good arguments in favour:

  • The initial 8,000 recordings cover the most important words and these should have the best quality
  • Having the same consistency and quality for all recordings would be nice

On the other hand, doing 8,000 words again is a lot of effort. My Kenyan friend, who does the recordings, is very patient but I'm sure there's a limit. Editing the files and doing QA also takes substantial time.

Finding words

The lofty goal of this project was to "cover the whole dictionary" (for Swahili), but what does that mean in practice?

First, what is a good source of words? I used Wiktionary as the starting point, as I want to create pronunciation files that can be used on Wiktionary. Wiktionary, of course, is not complete. But what I noticed during this work is that various word forms for words that are on Wiktionary are missing.

For example, there are around 4,100 Swahili nouns, but only 1,501 plural nouns. While you can't expect the number of plurals to be exactly twice the number of singular nouns (as some nouns don't have a plural and some plural forms are actually the same as the singular), this shows that a lot of pages are missing on Wiktionary.

There are bots on Wiktionary that create such forms automatically from the main entry, but there's no such bot for Swahili. This is something I hope to work on.

Another problem is that Wiktionary is not complete. I have identified a number of other sources for Swahili words (which hopefully will also be added to Wiktionary later and then we'll already have a pronunciation file).

In other words, the project to record audio has let to several related projects to find words and to improve Wiktionary.

Second, what does the "whole dictionary" mean? For example, let's take the English word "walk". It has many variants, such as "walks", "walking", and "walked". Is it reasonable (and even necessary) to record all of these variants? Swahili has a lot of variants and I don't think it's possible to record all of them.

My main focus right now is on:

  • Verbs: stems and infinitives
  • Nouns: singular and plural
  • Other lemma forms

Accomplishments

I've refined the recording process and I'm now happy with the audio quality. There are still some open questions in terms of best practices that I'd like to resolve before uploading the files.

We recorded over 1,700 new pronunciations. This consists of ~580 lemmas that have been added to Wiktionary in the last two years and ~1300 plural nouns.

Next steps

  • Create pages on Wiktionary (plural nouns, infinitive verbs, etc)
  • Find other sources for Swahili words
  • Record more words
  • Resolve outstanding questions on best practices for pronunciation files and upload to Wikimedia Commons
  • Document my audio recording process in more detail

Accomplishments

November 2022

After a long hiatus, I finally had time to work on this again with my Kenyan friend.

In the last few months, I did some prep work for more audio recordings. I went through the Swahili verbs on Wiktionary and created around 175 infinitive forms that were missing.

Furthermore, I identified some sources for words to record that are currently missing on Wiktionary. I found around 7,000 words, but this needs some more work (such as validation).

The focus was on:

  • Verbs: infinitives
  • Nouns: plurals
  • Words (lemmas) that have been added to Wiktionary since the last recording

We were able to complete the recording of these words.

Recording process

In general, I'm happy with the quality of the recordings.

I'm still trying to get the process right. In particular, I struggle with the level of editing that makes sense. There are some aspects that are easy (with the Audacity audio tool), such as noise reduction and splitting the recording into several files.

However, I spend considerable time going through the recording to remove noise. Such noise can be clicking of lips before speaking or accidentally touching the microphone or table, or minor background noise.

This editing is very time consuming and is the most frustrating part of this work. Honestly, I'm not sure it's actually worth the effort. It might be easier to just throw some words away during QA and record them again rather than doing all of this editing.

The following table shows the time (in minutes) spent on each task for a number of different recording batches:

Recording Editing Listening QA Total time # of words Minutes per 100 words
18 34 12 7 71 311 23
16 12 13 5 46 274 17
15 14 13 6 48 274 18
16 9 11 7 43 274 16
18 22 13 7 60 306 20
18 9 10 7 44 300 15
18 9 13 7 47 300 16
18 10 13 7 48 300 16
19 16 12 7 54 303 18
18 7 18 7 50 290 17

Most notable is the first one where I spent a lot of time editing. Reducing this time makes a huge impact. I've been trying to reduce the time spent on editing, but sometimes I spent too much time on it.

Accomplishments

  • Completed the processing of the 1,700 recordings from May
  • Recorded roughly 350 noun plurals (for a total of 1,650 with the 1,300 done in May)
  • Recorded roughly 1,500 verb infinitives
  • Recorded roughly 150 new lemmas
  • Uploaded everything to Wikimedia Commons (roughly 3,700 audio files)

Next steps

  • Identify and validate other sources for words
  • Record more words
  • Look at Lingua Libre
  • Document my audio recording process in more detail


January 2023

The focus in the last two months has been on:

  • Adding words from other sources
  • Re-recording old recordings
  • Recording words with multiple pronunciations

Adding words from other sources

While Wiktionary has some Swahili words, the dictionary is quite incomplete. I've found a number of sources for words that should also be recorded. These words will, hopefully, be added to Wiktionary in the future, and so it's good to have the pronunciation.

While we recorded many new words, we also ran into an unexpected problem: these words are often quite rare and my Kenyan friend sometimes doesn't know how to pronounce them. We've therefore skipped some words. My original goal to record the whole dictionary may not be attainable.

Re-recording old recordings

One question I raised previously was what to do about the recordings we did before obtaining the RØDE NT-USB Mini microphone.

These are 8,000 recordings, so it would be a lot of effort to record them again with the microphone. Listening to all and deciding which to re-record is also a lot of effort.

In the end, we decided to record all of them again: while some recordings are good, a lot of them have noise (for example, compare the recordings for the word amba). Also, when we started, we recorded the most important words and therefore the quality of these recordings matters in particular.

Out of the 8,000 original recordings, we have replaced 2,500 recordings and we recorded another 2,700 that currently await processing and uploading.

I update my upload script to deal with the replacement of existing files (also updating the meta-data).

Recording words with multiple pronunciations

While the pronunciation of Swahili is pretty regular, there are some words (usually of Arabic origin) that can have multiple pronunciations.

We identified three of them and uploaded both variants (barabara, mujibu, and wajibu) but there are probably others.

Accomplishments

  • Recorded 2,760 new words (including many words not currently found on Wiktionary)
  • Re-recorded 5,200 words to replace the original recordings (2,700 still require processing)
  • Updated upload script to replace existing files and update meta-data

Next steps

  • Finish the re-recording of the original recordings
  • Document my audio recording process in more detail


April 2023

The focus in the last few months has been on:

  • Re-recording old recordings
  • Adding words from other sources

Re-recording old recordings

We recorded 8,278 words with a fairly mediocre microphone before getting the RØDE NT-USB Mini. Despite the effort required, we decided to re-record all of them because (some of) the original recordings have quite a bit of noise.

We made good progress with this task. We processed and uploaded the 2,700 recordings from January and recorded a further 1,200 words.

This means that about 75% of the original recordings have been replaced now, leaving around 2,150 to be re-recorded.

Adding words from other sources

We recorded around 1,800 new words.

In total, there are over 16,500 audio recordings now.

Thoughts on quality

In a previous report, I presented some figures on the minutes per 100 words it takes to do these recordings. I realized that these figures were lacking an important detail (which underestimates the real effort required): the number of bad recordings that have to be thrown away.

Recordings can be bad for a number of reasons:

  • Background noise: e.g. car horns
  • Accidental noises: hitting the table or making noises with the mouth
  • Mispronunciations: speaking a similar but different word, leaving out certain letters, etc

Such recordings have to be done again, which increases the overall time. Unfortunately, I didn't keep exact records on the reasons for throwing away recordings. However, I should be able to get a rough count on the number of bad words for each session. I'll try to look into this in the future.

While we listen to all recordings twice, I'm sure some mispronunciations do not get caught and are uploaded. It might be a good idea to recruit volunteers to listen to the recordings and report issues. This makes me wonder more generally: does Wikimedia Commons have a clear process for reporting bad recordings? I don't think so. I have to add this to the list of topics I want to bring up with the Wikimedia Commons community.

Finally, I feel a bit frustrated right now. I live in a very noisy environment and this makes recording difficult. While the quality of the current recordings is okay (certainly better than the ones with the old microphone), they are far from professional quality. Maybe the Wikimedia community should actively reach out to professional voice over artists to contribute recordings.

Accomplishments

  • Processed 2,700 recordings from January
  • Recorded 1,800 new words (including many words not currently found on Wiktionary)
  • Re-recorded 1,200 words to replace the original recordings

Next steps

  • Finish the re-recording of the original recordings
  • Document my audio recording process in more detail

Other languages

While there is more work to be done, I feel we're reaching diminishing returns here. We have covered >95% of Swahili lemmas on Wiktionary as well as plural nouns and verb infinitives. I started this effort to have audio for language learning flash cards and this goal has been achieved.

Living in the Philippines, there would be many other languages that I could focus on, most notably Cebuano (Bisayan). One problem with recording Cebuano is that some words have several pronunciations that vary by stress. For example, the word "puso" has different meanings depending on the stress. Unfortunately, Wiktionary doesn't indicate the stress for most Cebuano words; I feel like Wiktionary needs more work before recordings are practical. I reached out to a Filipino Wiktionary editor to discuss this problem but did not hear back.

While Cebuano recordings would be nice, my current interest is with Tok Pisin, an English Creole which serves as the lingua franca of Papua New Guinea (PNG). There are quite a few students from PNG in the Philippines, so I might be able to find a volunteer for this.