The Corpus

Colonia is a historical Portuguese corpus containing texts spanning from the 16th to the 20th century. It contains over 5,1 million tokens divided into five sub-corpora.

Data

The corpus contains complete Portuguese manuscripts published from 1500 to 1936 devided into 5 sub-corpora per century as shown below. The corpus was POS tagged using TreeTagger.

CenturyTextsTokens
16th13399,245
17th18709,646
18th14425,624
19th382,490,771
20th171,132,696
Total1005,157,982

Texts are balanced in terms of the variety, consisting of 48 European Portuguese texts and 52 Brazilian Portuguese texts. You can find more information in the paper that describes the corpus. The complete inventory of texts is here and more detail regarding annotation can be found here.

Accessing the Corpus

The corpus can be downloaded with POS annotation or accessed via CQP query interface.

Citing the Corpus

Zampieri, M. and Becker, M. (2013) Colonia: Corpus of Historical Portuguese. In: ZSM Studien, Special Volume on Non-Standard Data Sources in Corpus-Based Research. Volume 5. Shaker. [pdf]

 

© 2013 Marcos Zampieri
Template design by Andreas Viklund