Recipes
=======

These are some additional possible uses for chardetng_py.

If there’s sufficient interest, we can stabilise these and include them
in the main package.

Detect the encoding of a bytestring and return a CodecInfo object
-----------------------------------------------------------------

.. code:: python

   def detect_codec(
       byte_str: Union[bytes, bytearray], *, allow_utf8: bool = True
   ) -> codecs.CodecInfo:
       r"""Detect the encoding of byte_str and return a CodecInfo object.

       Parameters
       ----------
       byte_str : bytes or bytearray
           Input buffer to detect the encoding of.

       Examples
       --------
       >>> codec = detect_codec(b"Jakby r\xeaka Boga")
       >>> codec.name
       'cp1254'

       """

       return codecs.lookup(detect(byte_str, allow_utf8=allow_utf8))

Detect the encoding of a bytestring and return the decoded string
-----------------------------------------------------------------

.. code:: python

   def decode(
       byte_str: Union[bytes, bytearray],
       errors: Literal[
           "strict", "ignore", "replace", "backslashreplace", "surrogateescape"
       ] = "strict",
       *,
       allow_utf8: bool = True,
   ) -> str:
       r"""Detect the encoding of byte_str and return the decoded string.

       Parameters
       ----------
       byte_str : bytes or bytearray
           Input buffer to decode.
       errors: "strict" or "ignore" or "replace" or "backslashreplace" or "surrogateescape"
           Error handler to use. See [Python documentation](https://docs.python.org/3/library/codecs.html#error-handlers)

       Examples
       --------
       >>> decode(b"Jakby r\xeaka Boga")
       'Jakby rêka Boga'

       """
       return byte_str.decode(detect(byte_str, allow_utf8=allow_utf8), errors=errors)

Open a file, incrementally determine its encoding and return a TextIOWrapper
----------------------------------------------------------------------------

This is a neat trick that allows you to open a file and detect its
encoding with a fixed amount of memory. The other bindings I’ve found
don’t support this use-case and you end up having to read the entire
file into memory, which is a problem for huge files.

This also lets you directly pass a text file of unknown encoding to
csv.writer of csv.DictWriter, for example.

.. code:: python

   # Reads entire file
   # We could add support for reading to some fixed position
   def _detect_buffer(buffer: IO[bytes], *, allow_utf8: bool = True, **kwargs):
       cursor_initial_position = buffer.tell()

       encoding_detector = EncodingDetector()

       # Not sure this is the best chunk size?
       while chunk := buffer.read(io.DEFAULT_BUFFER_SIZE):
           encoding_detector.feed(chunk, last=False)

       encoding_detector.feed(b"", last=True)

       buffer.seek(cursor_initial_position)

       return io.TextIOWrapper(
           buffer,
           encoding=encoding_detector.guess(tld=None, allow_utf8=allow_utf8),
           **kwargs,
       )


   # Could be nice to have an async one as well
   # unfortunately async fs tools aren't in std lib
   @contextmanager
   def detect_open(
       file: Union[bytes, str, PathLike], mode: Literal["r", "rt"] = "r", **kwargs
   ):
       """Open a file and detect its encoding."""
       if mode not in {"r", "rt"}:
           raise NotImplemented("Only reading supported at the moment")
           # TODO Could support r+ and w+ modes of operation?

       # The whole point is that we're going to detect in
       if "encoding" in kwargs:
           raise ValueError

       with open(file, mode="rb", **kwargs) as f:
           yield _detect_buffer(f)