We’re so excited to share that the 22nd dataset release for Common Voice is now available for download.
Common Voice 22.0 has an additional 281 hours of speech data, bringing the total number of hours to 33,815. This release has also seen a jump in 296 newly validated hours, with a total of 22,640 validated hours of clips. This release welcomes the addition of Aromanian (rup), Tajik (tg), and Venda/Tshivenda (ve) languages.
Aromanian is spoken by around 210,000 people in the Balkans, while Tajik is a language closely related to Persian spoken in Tajikistan and Uzbekistan by over 10 million people. Venda / Tshivenda is spoken by over 2 million people as a first or other language in South Africa and Zimbabwe.
This brings the total number of languages available in this Scripted Speech release to 137.