Qatari Arabic Corpus

Afra Al-Qahtani, Noha Selim, Heba Al-Kababji, Mohamed Elmahdy, Mark Hasegawa-Johnson, and Eiman Mustafawi

The Qatari Arabic Corpus is now available for download at

Content Distributed Files
Speech was recorded from four Qatari television programs in 2009-2011:
  • Al-Jazeera interviews: 207 minutes (multi-dialect recorded in Qatar; relatively formal)
  • LAKOM: 240 minutes (Moroccan dialect; not yet transcribed)
  • Sabah El-Doha: 110 minutes (multi-dialect recorded in Qatar; relatively informal)
  • Tesaneef 550 minutes (Qatari dialect, extremely informal)
  • Nineteen hours of monaural broadcast speech audio,
    • 16 bits/sample in WAV format,
    • recorded at 44.1kHz sampling rate, but
    • downsampled to 16kHz sampling rate for distribution.
  • Fifteen hours of phonetic transcription
    • Arabic script,
    • fully vowelized,
    • extended with Persian and Urdu characters in order to distinguish phonemes that are not part of the core Arabic orthography.
  • Fifteen hours of English gloss.