This section contains the following:
- Converter: The various jupyter notebooks used for the two-step conversion (XML->pickle->TF)
- Sourcedata: The various versions of the XML data (used as input to the first step of the conversion, i.e. XML -> pickle).
- Picklefiles: The various versions of the (zipped) pickle files (=output step 1) used for creating the Text-Fabric files.
Also in this directory are a few Jupyter Notebooks related to handling of the source data:
- Compare content of two XML files.
- Examining the differences between features 'word' and 'normalized'.
- Find duplicate structure headings.
- Identify punctuations used in corpus.
- Identifying the use of critical signs in the text.
- Comparing attributes 'unicode', 'normalized' in the XML data.
The following notebooks are not directly related to the creation of the Text-Fabric dataset, but are added to analyse some aspects of the GBI source data.