Creating a speech corpus #1: Before you begin

Before you start collecting data, you need to do some due diligence. Because as important as speech data sets are, they are not trivial to create, and you need to balance what you want from the data with the time and resources you can access. I don’t mean to suggest that developing speech data sets isn’t important (it is), but rather that it needs to happen after careful consideration. This post gets at some of the things you’ll want to think about before you start planning your dream corpus. ...

20 Jul 2021 · 6 min · Khia A. Johnson

Creating a speech corpus: A new blog series

So you want to create a new speech dataset? There are a lot of things to consider at every stage of the process. This is the first (introductory) post in a series I’m starting on the topic, based on my experience developing the SpiCE corpus of Speech in Cantonese and English. There are undoubtedly things I could have done better, but in any case, I certainly learned a lot about speech data along the way. ...

29 Jun 2021 · 2 min · Khia A. Johnson

The SpiCE corpus is officially out!

If you follow my work, you probably know that I’ve spent the last few years working on a new open-access Cantonese-English bilingual speech corpus. Well, today makes the first official release of the data into the wild. I’m thrilled to share it with all of you, and 100% expect to see lots of great work with the corpus. There are many places you can learn more about it, starting here. ...

20 May 2021 · 2 min · Khia A. Johnson

Sibilant trajectories with Python + praat-parselmouth

Once I’ve identified a sample of speech sounds that I want to analyze, the next step is to do that analysis. There are obviously many ways to go about this process. Here, I’ll walk through an example of measuring sibilant trajectories with the fantastic praat-parselmouth Python package. It’s my current favorite technique for avoiding Praat scripting. ...

3 May 2021 · 5 min · Khia A. Johnson

You should use tidylog in your #rstats corpus phonetics workflow

Last week, I asked #rstats twitter for a bit of help with something that has always felt clunky in my R code but was never annoying enough to actually fix. In corpus phonetics, you typically start with a large data set, make measurements, and then use informed criteria to filter out errors to the best of your ability, because measurements can be wrong. When you go to share your findings, you need to report how many items were removed (and why). To do this, you have to keep track. Sure, alternating between filter() and print(nrows(df)) works, but it’s clunky. I’m starting to think that maybe I should have been annoyed earlier. ...

29 Mar 2021 · 5 min · Khia A. Johnson