API

Module holding a class that computes the mask for HTML documents.

DegrotesqueHTMLMarker

Bases: DegrotesqueMarker

A class that returns the mask for SGML (HTML/XML) documents.

Masks all element (opening, closing, single) element definitions and everything else that is within < and >. Masks the contents of code elements (<pre>, <code> and others). Masks links.

get_extensions()

Returns the extensions of file types that can be processed using this marker.

Returns:
  • List[str]

    A list of extensions

get_mask(document, to_skip=None)

Returns a string where all HTML-elements are denoted as '1' and plain content as '0'.

Parameters:
  • document (str) –

    The HTML document (contents) to process

  • to_skip (List[str], default: None ) –

    List of elements to skip (HTML/SGML/XML)

Returns:
  • str

    Annotation of the HTML document.

_get_tag_name(document)

Returns the name of the tag that starts at the begin of the given string.

Parameters:
  • document (str) –

    The HTML-subpart

Returns:
  • str

    The name of the tag