If you follow my work, you probably know that I’ve spent the last few years working on a new open-access Cantonese-English bilingual speech corpus. Well, today makes the first official release of the data into the wild. I’m thrilled to share it with all of you, and 100% expect to see lots of great work with the corpus. There are many places you can learn more about it, starting here.
Once I’ve identified a sample of speech sounds that I want to analyze, the next step is to do that analysis. There are obviously many ways to go about this process. Here, I’ll walk through an example of measuring sibilant trajectories with the fantastic
praat-parselmouth Python package. It’s my current favorite technique for avoiding Praat scripting.
Last week, I asked #rstats twitter for a bit of help with something that has always felt clunky in my R code but was never annoying enough to actually fix. In corpus phonetics, you typically start with a large data set, make measurements, and then use informed criteria to filter out errors to the best of your ability, because measurements can be wrong. When you go to share your findings, you need to report how many items were removed (and why). To do this, you have to keep track. Sure, alternating between
print(nrows(df)) works, but it’s clunky. I’m starting to think that maybe I should have been annoyed earlier.
So you want to work with annotated speech? I find many (not all!) purpose-built tools for corpus phonetics to be slow, buggy, inflexible, or incomplete, while simultaneously promising to do way more than I actually need. Building a SQLite database from
.TextGrid files ended up being a straightforward solution, and it wasn’t hard to do. Here’s a quick tutorial.
Categorical variables are something I think about a lot in my psycholinguistic research, and they aren’t often given enough time in introductions to mixed effects modeling. I had originally planned to do a write-up on them here, but have now found enough good resources—papers and talks—that I think I’ll just list them with some accompanying comments.
I’m working on a project with a somewhat tricky analysis, and had hit a wall. I didn’t want to just figure the analysis out as I went, for fear of baking researcher bias into the results, but also wasn’t sure if the half-baked analysis would actually answer the question. Enter data simulation, a.k.a. something I should have done a long time ago.
These are short blurbs from the news section of my old website. Not a lot of detail, but there are plenty of links! Some of these items might get a more in depth treatment as I build out this blog.
A blog about language data analysis and communicating science. Why write it? I think it’s important to make research on language accessible, and want to get better at it… so, practice? I’ll be thrilled if you like my writing and learn something too. I’m also thinking about writing the tutorials I wished existed when I first started learning, as a way to share the expertise I’ve gained and give back to the community.