Mark Dingemanse & Andreas Liesenfeld 2022-03-20
Abstract: Informal social interaction is the primordial home of human language. Linguistically diverse conversational corpora are an important and largely untapped resource for computational linguistics and language technology. Through the efforts of a worldwide language documentation movement, such corpora are increasingly becoming available. We show how interactional data from 63 languages (26 families) harbours insights about turn-taking, timing, sequential structure and social action, with implications for language technology, natural language understanding, and the design of conversational interfaces. Harnessing linguistically diverse conversational corpora will provide the empirical foundations for flexible, localizable, humane language technologies of the future.
This document produces the figures and analyses reported in the ACL2022 paper “From text to talk: Harnessing conversational corpora for humane and diversity-aware language technology”.
The paper is based on a curated collection of conversational corpora that are individually made available for research purposes. The full set of 0 corpora for 73 of 29 is documented in a separate code notebook, along with reports on the most important features of each corpus.
Not all of the languages or corpora are represented in all analyses and figures presented in the paper, because the corpora differ in size, precision of annotation, level of transcription. Accounts of inclusions and exclusions are provided in the reports by language.
Note on data availability. The corpora we rely on are made available for research purposes, but come under a variety of usage restrictions which in many cases (and for sensible reasons) prevent redistribution. We try to provide all relevant details about the corpora and how to access them in the following ways:
- All corpora are cited in the paper with at least the name of the compilers and a durable HANDLE or DOI locator.
- Links to all corpora and a considerable amount of additional data are also provided in the separate reports by language.
- Our data curation workflow is detailed in this preprint.
This also means that the report below is based on the full data but we cannot share all of this data in the repository. Instead, we can share only derived measures and samples. We have tried to make our analysis as perspicuous as possible by including the code used to generate the measures and samples in the Rmd code.
The subset of corpora considered here amounts to around 800 hours of speech, or 9.3 million words, segmented into 1.6 million annotations produced by over 11.000 participants.
## # A tibble: 63 x 6
## language turns words minutes hours people
## <chr> <int> <dbl> <dbl> <dbl> <int>
## 1 +Akhoe 721 3253 28.7 0.478 18
## 2 Akpes 635 3965 17.8 0.296 4
## 3 Ambel 1509 6601 42.0 0.700 22
## 4 Anal 6767 26826 278. 4.63 54
## 5 Arabic 33120 201207 1211. 20.2 365
## 6 Arapaho 4821 55850 243. 4.06 109
## 7 Baa 1361 12553 65.3 1.09 7
## 8 Br. Portuguese 3242 17109 88.4 1.47 2
## 9 Catalan 11059 93827 398. 6.64 83
## 10 Chitkuli 1123 13462 68.3 1.14 20
## # ... with 53 more rows
We generate a map with numbers for labels (used in the paper).
We also produce a graph of language resources by size.
We look at the timing of turn-taking in sufficiently large corpora, limiting the analysis to dyadic interactions because triadic and multi-party interaction is qualitatively different. We determine participation framework based on a 10 second rolling window: if in the window there are no more than 2 participants we count the interaction at that point as dyadic.
Five corpora are excluded from these analyses because they have unclear or incommensurable segmentation conventions: Brazilian Portuguese, Croatian, Czech, Hungarian, and Nganasan. A further number of corpora do not feature at least 1000 dyadic turn transitions.
There are 24 languages of 1 language families in which the corpora provide >1000 turn transitions in dyadic settings. The median floor transfer onset time (FTO) across the sample is 10 ms, very close to the no-gap no-overlap goal seen in prior work.
As a language-agnostic approach to activity types, we consider a distinction between ‘chats’ and ‘chunks’. We identify a number of these and plot them in six unrelated languages. Chunks are identified by looking for streaks of >2 identical turns by the same participant occurring in close succession while they make up less than 30% of talk in the surrounding 10 second window.
Chats are identified by looking for turns with a duration of 3 seconds that occur in a regime where contributions in the 10 second window surrounding them are evenly distributed (both speakers contribution between 40% to 60% of talk).
We find that the most frequent turn formats used in English and Korean tellings are ‘mhm’ and ‘eung’ respectively. We plot 4 examples of 80 second stretches featuring continuers, and sample 100 such segments for each of the languages to compare the relative frequency of continuers.
We sample 100 stretches of 80000 ms (80 s) in each of the languages.
The sample is randomized each time, but with the seed used in the paper, we end up with 100 samples from 53 Korean source recordings with 106 distinct participants, and another 100 samples from 55 English source recordings with 96 distinct participants.
Here are the basic descriptive statistics for number of total turns, mean number of continuers, and overlap.
## # A tibble: 2 x 9
## langshort n_turns mean_cont n_cont_sd mean_rel n_rel_sd mean_inv n_inv_sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 English 30.5 2.56 1.71 0.0920 0.0859 14.5 12.8
## 2 Korean 35.7 6.98 4.22 0.201 0.103 6.99 4.95
## # ... with 1 more variable: mean_overlap <dbl>
Note that average duration and number of turns per 10 second window is not very different across the two languages, so won’t really explain this:
## # A tibble: 2 x 4
## language meanduration turnsper10sec wordsperturn
## <chr> <dbl> <dbl> <dbl>
## 1 english 1871. 5.64 6.85
## 2 korean 2545. 5.55 5.74
Putting it all together, we get at the following figure:
The number of truly unique turns across the whole dataset is 1096474. However, the total number of turns is 1532886, so at least a quarter across all languages (436412 out of 1532886) occurs more than once. Of these, one fifth (329634) occur more than 20 times.
For analysing the relation between frequency and rank at the level of turn formats we use only corpora in which there are at least 20 recurring turn formats. There are 22 such languages (representing 8 families). We also look at a further set of 17 smaller corpora (representing 21 languages of 18 families) in which there at least 9 recurring turn formats.
What is the proportion of multiword versus one word recurrent turn formats?
## # A tibble: 2 x 3
## words n prop
## <chr> <int> <dbl>
## 1 more 225 0.4
## 2 one 370 0.6
Stivers et al. 2009 work with polar questions only. We can try to approach that by looking at all FTOs of turn types that are reasonably frequent (= likely to be conventionalized responses to polar questions) that follow turns that end in a question mark.
We limit ourselves to languages in which 250 such sequences are found. Here are, for each of these 10 languages, one example of a candidate QA sequence so identified:
language | Q | A |
---|---|---|
Catalan | a quina hora li he dit que arribava? | sí |
Dutch | maar een zomerjurk? | ja |
English | thats going to your house right? | yeah |
German | und äh wir waren jetzt ja äh äh soll ich dir mal alles der Reihe nach erzählen? | ja |
Kerinci | lebih baik itòh bé jadi maharnyo kan? | iyò |
Korean | ja-gi-do geu teum-e ggi-eo-seo gan geo-ya? | eung |
Mandarin | hao de hao de hao de hao jiu zhege shiqing shiba? | e |
Polish | a czy znala jezyk litewski? | mhm |
Sambas | itoq lah anak béduwaq i? | eeq |
Spanish | oye y la estela cómo está? | sí |
Then we print the number of candidate QA sequences by language and plot the distribution of FTOs (floor transfer onset times).
## # A tibble: 10 x 2
## language n
## <chr> <int>
## 1 Catalan 930
## 2 Dutch 557
## 3 English 1230
## 4 German 1540
## 5 Kerinci 682
## 6 Korean 4294
## 7 Mandarin 2698
## 8 Polish 685
## 9 Sambas 501
## 10 Spanish 2363
While the body of the paper provides a plot for 22 large corpora, here we als plot the rank-frequency distributions for turns and words for 21 smaller corpora in which there are at least 9 recurrent turn formats.
This table shows language name, family, glottocode and citation key. The paper contains full bibliographic metadata.
## # A tibble: 63 x 4
## Language Family glottocode Citation
## <chr> <chr> <chr> <chr>
## 1 +Akhoe Khoe-Kwadi haio1238 widlokCollectionAkhoeHai2007
## 2 Akpes Atlantic-Congo akpe1248 lauDocumentingAbesabesi2019
## 3 Ambel Austronesian waig1244 arnoldDocumentationAmbelAustronesia~
## 4 Anal Sino-Tibetan anal1239 ozerovCommunitydrivenDocumentationN~
## 5 Arabic Afro-Asiatic egyp1253 canavanalexandraCALLHOMEEgyptianAra~
## 6 Arapaho Algic arap1274 cowellConversationalDatabaseArapaho~
## 7 Baa Atlantic-Congo kwaa1262 mollernwadigoDocumentationProjectBa~
## 8 Br. Portuguese Indo-European braz1246 dasilvaProjetoNormaUrbana1996
## 9 Catalan Indo-European stan1289 garridoGlissandoCorpusMultidiscipli~
## 10 Chitkuli Sino-Tibetan chit1279 martinezDocumentaryCorpusChhitkulRa~
## # ... with 53 more rows