API

Module holding the base class for computing the mask holding document parts not to process.

DegrotesqueMarker

The base class for computing the mask holding document parts not to process.

get_extensions() abstractmethod

Returns the extensions of file types that can be processed using this marker.

Returns:
  • List[str]

    A list of extensions

get_mask(document, to_skip=None) abstractmethod

Returns a string where all parts to exclude from replacements denoted as '1' and all with plain content that shall be processed as '0'.

Parameters:
  • document (str) –

    The document (contents) to process

  • to_skip (List[str], default: None ) –

    List of elements to skip (HTML/SGML/XML)

Returns:
  • str

    Annotation of the document.

apply_masks(document, mask)

Masks (sets the contents of the mask to '1') all URLs and ISSN / ISBN.

The method is assumed to be called after an initial mask has been computed.

Parameters:
  • document (str) –

    The document (contents) to process

  • mask (str) –

    A previously computed mask

Returns:
  • str

    Annotation of the document.