The Corpus

Colonia is a historical Portuguese corpus containing texts spanning from the 16th to the 20th century. It contains over 5,1 million tokens divided into five sub-corpora.


The corpus contains complete Portuguese manuscripts published from 1500 to 1936 devided into 5 sub-corpora per century as shown below. The corpus was POS tagged using TreeTagger.


Texts are balanced in terms of the variety, consisting of 48 European Portuguese texts and 52 Brazilian Portuguese texts. You can find more information in the paper that describes the corpus. The complete inventory of texts is here and more detail regarding annotation can be found here.

Accessing the Corpus

The corpus can be downloaded with POS annotation or accessed via CQP query interface.

Citing the Corpus

Zampieri, M. and Becker, M. (2013) Colonia: Corpus of Historical Portuguese. In: ZSM Studien, Special Volume on Non-Standard Data Sources in Corpus-Based Research. Volume 5. Shaker. [pdf]


© 2013 Marcos Zampieri
Template design by Andreas Viklund