gerdracor-coref

German Drama Corpus for Coreference
git clone git://git.janispagel.de/gerdracor-coref.git
Log | Files | Refs | README | LICENSE

README.md (4342B)


      1 [![release](https://img.shields.io/github/release-pre/quadrama/gerdracor-coref.svg)](https://github.com/quadrama/gerdracor-coref/releases/latest)
      2 [![DOI](https://zenodo.org/badge/223186468.svg)](https://zenodo.org/badge/latestdoi/223186468)
      3 [![license](https://img.shields.io/badge/license-CC0-blue.svg)](https://github.com/quadrama/gerdracor-coref/blob/gold/LICENSE)
      4 
      5 # GerDraCor-Coref
      6 
      7 ## General Information
      8 
      9 The GerDraCor-Coref (German Drama Corpus for Coreference) is a fork of the [GerDraCor](https://github.com/dracor-org/gerdracor) and contains coreference annotations for a subset of the GerDraCor texts.
     10 The texts are all German dramatic texts, written between 1730 and 1920.
     11 Annotated are all noun phrases; singletons were removed.
     12 Additionally, generic entities, abstract anaphora and ambiguous mentions are also marked explicitly.
     13 In case of the latter two, only a part of the corpus has been annotated.
     14 
     15 ### File Naming
     16 
     17 The names of the files are composed of a short form of the title of the play and an appropriate file ending indicating the format, e.g. `Rosenkavalier.xmi`, `Rosenkavalier.xml`, `Rosenkavalier.conll` for "Der Rosenkavalier" by Hugo von Hofmannsthal.
     18 A full list of file names and their corresponding play is given in `plays.csv`.
     19 
     20 ### Partial Annotations
     21 
     22 Some texts have not been fully annotated, but only one or more acts.
     23 The act(s) annotated are indicated in the filename, e.g. `Manuscript_Act5.xmi`.
     24 If the full text was annotated, no special marker is applied, e.g. `Sara.xmi`.
     25 
     26 ### Parallel Annotations
     27 
     28 In order to make Inter-Annotator agreement studies possible, we carried out parallel annotations of single acts, annotated by distinct annotators.
     29 These annotations are located in separate branches and the annotator and act is additionally indicated in the filename, e.g. `Sara_AS_Act1`. `gold` annotations are not specially marked in the filename. (ToDo)
     30 
     31 ### Encoding
     32 
     33 All files are encoded in UTF-8 Unicode.
     34 
     35 ## Formats
     36 
     37 We provide several formats to represent the coreference annotations:
     38 
     39 - XMI
     40 - TEI
     41 - CoNLL 2012
     42 - DIRNDL
     43 
     44 For the texts that have not been fully annotated, we additionally provide TEI output only for the parts that have been annotated.
     45 The CoNLL output always only contains the annotated parts.
     46 The XMI output always contains the full text.
     47 
     48 ### XMI
     49 
     50 As the XMI files can become quite large, they have been compressed using `gzip`.
     51 Uncompress them by entering a command line and run
     52 
     53 ```sh
     54 $ gzip -d <FILENAME>.xmi.gz
     55 ```
     56 
     57 ### DIRNDL
     58 
     59 DIRNDL is a file format based on the CoNLL format, but additionally also contains a speaker column (among others).
     60 
     61 ## Organization
     62 
     63 The annotations are sorted into folders according to the different output formats.
     64 Parallel annotations by different annotators are organized into branches in the git tree. (ToDo)
     65 The main annotations are located in the `gold` branch.
     66 Partial annotations are sorted under the main folder in a subfolder called `part`.
     67 
     68 ### Folder structure
     69 
     70 ```sh
     71 $ tree -d
     72 .
     73 ├── conll
     74 │   └── part
     75 ├── tei
     76 │   └── part
     77 └── xmi
     78 ```
     79 
     80 ### Branches
     81 
     82 ```sh
     83 $ git branch
     84 * gold
     85 ```
     86 
     87 ## Citing
     88 
     89 If you are using GerDraCor-Coref for a publication, please refer to the following paper:
     90 
     91 - Janis Pagel, Nils Reiter. GerDraCor-Coref: A Coreference Corpus for Dramatic Texts in German. In *Proceedings of the Language Resources and Evaluation Conference (LREC)*, pp. 55-64, Marseille, France, May 2020. Url: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.7.pdf. 
     92  
     93 ```
     94 @inproceedings{gerdracorcoref,
     95    author    = {Janis Pagel and Nils Reiter},
     96    booktitle = {{Proceedings of the Language Resources and Evaluation Conference (LREC)}},
     97    location  = {Marseille, France},
     98    month     = {5},
     99    pages     = {55--64},
    100    title     = {{GerDraCor-Coref: A Coreference Corpus for Dramatic Texts in German}},
    101    url       = {http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.7.pdf},
    102    year      = {2020},
    103 }
    104 ```
    105 
    106 ## License
    107 
    108 Like [GerDraCor](https://github.com/dracor-org/gerdracor), GerDraCor-Coref is released under the [Creative Commons Zero copyright waiver CC0](https://creativecommons.org/share-your-work/public-domain/cc0/).
    109 
    110 ## Contribution
    111 
    112 We appreciate contributions regarding extensions, bug fixes and the like.
    113 Please feel free to create issues or pull requests.