The SpiCE corpus is officially out!

🗓 May 20, 2021 🌼 2-minute read

If you follow my work, you probably know that I’ve spent the last few years working on a new open-access Cantonese-English bilingual speech corpus. Well, today makes the first official release of the data into the wild. I’m thrilled to share it with all of you, and 100% expect to see lots of great work with the corpus. There are many places you can learn more about it, starting here.

The SpiCE corpus—Speech in Cantonese and English—is an open-access speech corpus available for download under a Creative Commons Attribution 4.0 International License via Scholars Portal Dataverse. This is what our research data management librarians said was best, and you should *always* trust your librarian.

Access the SpiCE Corpus here

A few quick details

SpiCE has…

  • Conversational interviews in 2 languages with 34 early bilingual participants
  • Approximately 23 total hours of transcribed participant speech
  • Force-aligned phone-level annotations
  • Detailed language background summary information
  • A DOI, specifically: 10.5683/SP2/MJOXP3

If you want to learn more first, you’ve got options

Read about SpiCE in…

You can also hear me talk about projects I’m working on right now with the corpus at two bilingualism conferences this summer.

  • I’m presenting a poster at the 3rd International Symposium on Bilingual and L2 Processing in Adults and Children (June 3-4, 2021) titled Uniformity and crosslinguistic influence in Cantonese-English bilingual stops. Here’s the abstract. FYI, the conference is free to attend!
  • In mid-July, I’ll be at the 13th International Symposium on Bilingualism (ISB13). I’m giving a talk about work with Molly Babel on Language contact within the bilingual: Adapting a model from sound change to crosslinguistic influence (abstract). The talk will be pre-recorded, and I’ll be sure to share it after the conference.
  • I’ll also be supporting my co-authors Angelina Lloy and Molly Babel at ISB13 in our talk, Examining the roles of language familiarity and bilingualism in talker recognition (abstract). This is the first of our experimental perception work using stimuli culled from the corpus.

But you do you. That’s the point of making things freely available. I’d love to hear about what you learn from this data!