TIR2015 Genre data

In our paper "Genre classification on German novels" we used a data set consisting of 1682 novels which are partly labeled with genre information. Novels were labeled into two subgenres, educational and social, by Lukas Weimer, student assistant at the chair of computer philology. All novels are freely available and were collected from TextGrid, DTA and Gutenberg.

All novels were written in or translated into German and date of origin ranges from the 16th to the 20th century. Authors include Charles Dickens, Theodor Fontane, Karl May, Sir Walter Scott and Émile Zola. Text lengths range from 4000 to over one million words, the average word count being 100,000.

For the TIR15 Genre corpus we divided the novels into five disjoint folders:

labeled educational: 37 novels
labeled social: 63 novels
prototype educational: 21 novels
prototype social: 11 novels
unlabeled: 1550 novels

In our paper we used the two prototype folders as one data set and prototype and labeled folders (= 132 novels) as the second. The entire set of novels was used for Latent Dirichlet Allocation for which the stopword list is provided additionally.

You can download the corpus using this link. If you would like to refer to the TIR15 Genre corpus please cite

Genre classification on German novels. Hettinger, Lena; Becker, Martin; Reger, Isabella; Jannidis, Fotis; Hotho, Andreas. In Proceedings of the 12th International Workshop on Text-based Information Retrieval. 2015.
- [ BibTeX ]
- [ URL ]
@inproceedings{hettinger2015genre, author = {Hettinger, Lena and Becker, Martin and Reger, Isabella and Jannidis, Fotis and Hotho, Andreas}, booktitle = {Proceedings of the 12th International Workshop on Text-based Information Retrieval}, keywords = {classification}, title = {Genre classification on German novels}, year = 2015 }

Hubland Nord

Bildnachweise