A blog about language data analysis and communicating science. Why write it? I think it’s important to make research on language accessible, and want to get better at it… so, practice? I’ll be thrilled if you like my writing and learn something too. I’m also thinking about writing the tutorials I wished existed when I first started learning, as a way to share the expertise I’ve gained and give back to the community.

About me

I’m a PhD candidate in Linguistics at the University of British Columbia. I study phonetic and phonological variation in bilingual speech, and think about psycholinguistic questions related to processing, representation, and language contact. I use corpus and experimental methods to look at these questions in speech production and perception. My dissertation examines how bilinguals produce similar (or different) voice quality and speech sounds in each of their languages. Essentially, how similar are you to yourself. For this research, I’ve developed a corpus of conversational bilinugal speech in Cantonese and English along with lots of support from undergraduate research assistants in the Speech-in-Context Lab, and initial funding from the UBC Public Scholars Initiative.

What I (want to) write about

I have a few writing ideas (“themes”) for this blog, which will probably change over time. In any case, all errors are mine.

  1. scicomm, or communicating science: Writing up my published research for non-specialist audiences
  2. analyses, or analyzing language data: Verbose walkthroughs of the analyses in my papers, posters, and talks. I try to share this kind of thing in my OSF repository, though most of what’s there is supplementary to the papers, and doesn’t stand alone.
  3. tutorials, or corpus phonetics tutorials of all shapes and sizes: These posts could be full-fledged tutorials where I explore a question with an open-access or web-scraped data set. Or they might be a small snippet of code that does something useful.
  4. keysmash, or anything that reads more like an opinion piece or otherwise doesn’t fit into the above categories. Time will tell if I use this option, but lkdjfoijfgid I like the name.

Some of the things I use

  • I use R for stats and data visualization..for the most part. Specifically, RStudio, the tidyverse, brms, and (less now than before) lme4.
  • I also use Python, mostly written in a notebook style in VSCode. I prefer Python to R when working with audio files using pydub and parselmouth, using NLP tools like nltk or pycantonese, and for web scraping or interfacing with the cloud in some way.
  • For acoustic analyses, the big players are Praat, Voicesauce, and ELAN for (some of the) accompanying transcripts. There are also command line tools that support doing corpus phonetics, such at the Montreal Forced Aligner, AutoVOT, and many others.