VOICE: Vienna-Oxford International Corpus of English

Title

VOICE: Vienna-Oxford International Corpus of English

Author

Barbara Seidlhofer ; Angelika Breiteneder; Theresa Klimpfinger; Stefan Majewski; Ruth Osimk-Teasdale (POS-tagged versions); Marie-Luise Pitzl; Michael Radeka (POS-tagged versions)

Availability

Available under the conditions of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) License (http://creativecommons.org/licenses/by-nc-sa/3.0/). This is a very liberal license that grants certain rights for non-commercial use, especially your right to use VOICE for your own research, but also reserves certain rights for the original creators of VOICE. Please refer to the file COPYING for the details of this license and the extent of the rights granted..

Download: zip

Languages

English

Editorial Practice

Encoding format: TEI P5 XML

OTA keywords

Linguistic corpora
Corpus

LC keywords

Linguistics
Linguistics analysis (Linguistics)

Extent
  • designation: CollectionText
  • size: 1695 files: ca. 353.9 MB
Creation Date

The corpus data were collected between 2001 and 2007, and compiled 2005 to 2009. VOICE Online was published in 2009, and VOICE XML in 2011. VOICE POS XML and VOICE POS Online were released in 2013.

Source Description

VOICE is based on audio-recordings of 151 naturally-occurring, non-scripted, face-to-face interactions involving 753 identified individuals from 49 different first language backgrounds using English as a lingua franca (ELF), i.e. English used as a common means of communication among speakers from different first-language backgrounds. The recordings were carried out between July 2001 and November 2007, usually using portable mini-disc recorders with external microphones. Most of the audio recordings are supplemented by detailed field notes including information about the nature of the speech event and the interaction taking place as well as about the participants engaging in these ELF interactions. The interactions recorded are complete speech events from different domains (educational, leisure, professional) and of different speech event types (conversation, interview, meeting, panel, press conference, question-answer session, seminar discussion, service encounter, working group discussion, workshop discussion). The audio-recordings were transcribed, checked and proof-read by trained transcribers and researchers in accordance with the VOICE mark-up and spelling conventions [2.1] (see http://www.univie.ac.at/voice/page/transcription_general_information). Details for each electronic text are given in the individual text headers.

This package includes the XML data files and additional material of the Vienna-Oxford International Corpus of English (VOICE). It comprises four data sets.

  • The first data set, VOICE 1.0 XML, corresponds to the data that was published via the web platform VOICE Online http://www.univie.ac.at/voice/ on 2009-05-22.
  • The second data set, VOICE 1.1 XML, corresponds to the data that was published via the web platform VOICE Online http://www.univie.ac.at/voice/ on 2011-05-05. It includes minor revisions and corrections that were made between 2011-01-24 and 2011-04-22.
  • The third data set, VOICE 2.0 XML, corresponds to the data that was published via the web platform VOICE Online http://www.univie.ac.at/voice/ on 2013-01-22. It includes minor revisions and corrections that were made between 2012-07-02 and 2012-07-31.
  • The fourth data set, VOICE POS XML 2.0, is the first part-of-speech tagged and lemmatized version of the corpus. It corresponds to the data on VOICE POS Online 2.0, which was published and is accessible via the web platform VOICE Online http://www.univie.ac.at/voice/ on 2013-01-22. It also corresponds to the data of VOICE 2.0 XML (see above), however including a number of differences in the encoding scheme (see section 4).

Notes

The download now also includes an updated version of VOICE XML (VOICE 2.0 XML) and a part-of-speech tagged and lemmatized version of VOICE (VOICE POS XML).

The primary language of the corpus is English as a lingua franca, with some switches to other languages. The corpus consists of manual transcriptions of audio recordings of speech.