DSL Shared Task (Finished)

Discriminating between similar languages and language varieties is one of the bottlenecks of language identification. This aspect has been topic of a number of papers published in the last years. The Discriminating between Similar Languages (DSL) shared task aims to provide a dataset to evaluate system's performance on discriminating 13 different languages in 6 groups of languages.


We will provide a training and a test set of 13 different languages organized in 6 groups (A to F). Each of the 6 groups contains closely related languages or language varieties (see below).

We will first provide a set of 20,000 instances per language (18,000 training + 2,000 development) in CSV format. Each instance is a full sentence extracted from journalistic corpora and written in one of the languages and tagged with the language group and country of origin. After one month we will release a test set containing 1,000 unidentified instances of each language to be classified according to the country of origin. We list below the aforementioned 13 languages and 6 groups involved:

Participants should return their results in up to 2 days after the release of the test set and scores will be calculated according to the systems' accuracy in identifying the country of origin. We allow two kinds of submissions (please indicate this when you fill your registration form):

The best systems will be invited to submit a paper describing their findings (8 pages + 2 for references). Template can be found here.


The DSL shared task will run following the schedule below:


We are pleased to report that 22 teams participated in the DSL shared task and 8 of them submitted their results. The shared task results can be found in this link. The organizers would like to thank all 22 groups for their participation.

If you have any questions, please send an e-mail to: dsl.sharedtask@gmail.com


Important Dates