README.md (4342B)
1 [](https://github.com/quadrama/gerdracor-coref/releases/latest) 2 [](https://zenodo.org/badge/latestdoi/223186468) 3 [](https://github.com/quadrama/gerdracor-coref/blob/gold/LICENSE) 4 5 # GerDraCor-Coref 6 7 ## General Information 8 9 The GerDraCor-Coref (German Drama Corpus for Coreference) is a fork of the [GerDraCor](https://github.com/dracor-org/gerdracor) and contains coreference annotations for a subset of the GerDraCor texts. 10 The texts are all German dramatic texts, written between 1730 and 1920. 11 Annotated are all noun phrases; singletons were removed. 12 Additionally, generic entities, abstract anaphora and ambiguous mentions are also marked explicitly. 13 In case of the latter two, only a part of the corpus has been annotated. 14 15 ### File Naming 16 17 The names of the files are composed of a short form of the title of the play and an appropriate file ending indicating the format, e.g. `Rosenkavalier.xmi`, `Rosenkavalier.xml`, `Rosenkavalier.conll` for "Der Rosenkavalier" by Hugo von Hofmannsthal. 18 A full list of file names and their corresponding play is given in `plays.csv`. 19 20 ### Partial Annotations 21 22 Some texts have not been fully annotated, but only one or more acts. 23 The act(s) annotated are indicated in the filename, e.g. `Manuscript_Act5.xmi`. 24 If the full text was annotated, no special marker is applied, e.g. `Sara.xmi`. 25 26 ### Parallel Annotations 27 28 In order to make Inter-Annotator agreement studies possible, we carried out parallel annotations of single acts, annotated by distinct annotators. 29 These annotations are located in separate branches and the annotator and act is additionally indicated in the filename, e.g. `Sara_AS_Act1`. `gold` annotations are not specially marked in the filename. (ToDo) 30 31 ### Encoding 32 33 All files are encoded in UTF-8 Unicode. 34 35 ## Formats 36 37 We provide several formats to represent the coreference annotations: 38 39 - XMI 40 - TEI 41 - CoNLL 2012 42 - DIRNDL 43 44 For the texts that have not been fully annotated, we additionally provide TEI output only for the parts that have been annotated. 45 The CoNLL output always only contains the annotated parts. 46 The XMI output always contains the full text. 47 48 ### XMI 49 50 As the XMI files can become quite large, they have been compressed using `gzip`. 51 Uncompress them by entering a command line and run 52 53 ```sh 54 $ gzip -d <FILENAME>.xmi.gz 55 ``` 56 57 ### DIRNDL 58 59 DIRNDL is a file format based on the CoNLL format, but additionally also contains a speaker column (among others). 60 61 ## Organization 62 63 The annotations are sorted into folders according to the different output formats. 64 Parallel annotations by different annotators are organized into branches in the git tree. (ToDo) 65 The main annotations are located in the `gold` branch. 66 Partial annotations are sorted under the main folder in a subfolder called `part`. 67 68 ### Folder structure 69 70 ```sh 71 $ tree -d 72 . 73 ├── conll 74 │ └── part 75 ├── tei 76 │ └── part 77 └── xmi 78 ``` 79 80 ### Branches 81 82 ```sh 83 $ git branch 84 * gold 85 ``` 86 87 ## Citing 88 89 If you are using GerDraCor-Coref for a publication, please refer to the following paper: 90 91 - Janis Pagel, Nils Reiter. GerDraCor-Coref: A Coreference Corpus for Dramatic Texts in German. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, pp. 55-64, Marseille, France, May 2020. Url: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.7.pdf. 92 93 ``` 94 @inproceedings{gerdracorcoref, 95 author = {Janis Pagel and Nils Reiter}, 96 booktitle = {{Proceedings of the Language Resources and Evaluation Conference (LREC)}}, 97 location = {Marseille, France}, 98 month = {5}, 99 pages = {55--64}, 100 title = {{GerDraCor-Coref: A Coreference Corpus for Dramatic Texts in German}}, 101 url = {http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.7.pdf}, 102 year = {2020}, 103 } 104 ``` 105 106 ## License 107 108 Like [GerDraCor](https://github.com/dracor-org/gerdracor), GerDraCor-Coref is released under the [Creative Commons Zero copyright waiver CC0](https://creativecommons.org/share-your-work/public-domain/cc0/). 109 110 ## Contribution 111 112 We appreciate contributions regarding extensions, bug fixes and the like. 113 Please feel free to create issues or pull requests.