Class Reference

This is the main python binding for chardetng, a rust character encoding detector for legacy Web content.

The class here is a wrapper around the rust struct EncodingDetector. The documentation for the rust structure is available on docs.rs.

For more information about the overall function of the library, read Henri Sivonen’s excellent write-up.

chardetng_py.detector

class chardetng_py.detector.EncodingDetector

A Web browser-oriented detector for guessing what character encoding a stream of bytes is encoded in.

The bytes are fed to the detector incrementally using the feed method. The current guess of the detector can be queried using the guess method. The guessing parameters are arguments to the guess method rather than arguments to the constructor in order to enable the application to check if the arguments affect the guessing outcome. (The specific use case is to disable UI for re-running the detector with UTF-8 allowed and the top-level domain name ignored if those arguments don’t change the guess.)

Methods

feed(buffer, *, last)

Inform the detector of a chunk of input.

guess(*, tld, allow_utf8)

Guess the encoding given the bytes pushed to the detector so far (via feed()), the top-level domain name from which the bytes were loaded, and an indication of whether to consider UTF-8 as a permissible guess.

guess_assess(*, tld, allow_utf8)

Performs the same function as guess() with the same parameters, but additionally returns whether the guessed encoding had a higher score than at least one other candidate.

feed(buffer, *, last)

Inform the detector of a chunk of input.

The byte stream is represented as a sequence of calls to this method such that the concatenation of the arguments to this method form the byte stream. It does not matter how the application chooses to chunk the stream. It is OK to call this method with a zero-length byte slice.

The end of the stream is indicated by calling this method with last set to True. In that case, the end of the stream is considered to occur after the last byte of the buffer (which may be zero-length) passed in the same call. Once this method has been called with last set to True this method must not be called again.

If you want to perform detection on just the prefix of a longer stream, do not pass last=True after the prefix if the stream actually still continues.

Returns True if after processing buffer the stream has contained at least one non-ASCII byte and False if only ASCII has been seen so far.

## Parameters

buffer : bytes or bytearray The next chunk of the byte stream. last : bool Whether this is the last chunk of the byte stream.

## Returns

bool True if the stream has contained at least one non-ASCII byte and False if only ASCII has been seen so far.

## Raises

pyo3_runtime.PanicException If this method has previously been called with last set to True.

guess(*, tld, allow_utf8)

Guess the encoding given the bytes pushed to the detector so far (via feed()), the top-level domain name from which the bytes were loaded, and an indication of whether to consider UTF-8 as a permissible guess.

## Parameters

tld : bytes or bytearray or None The rightmost DNS label of the hostname of the host the stream was loaded from in lower-case ASCII form. That is, if the label is an internationalized top-level domain name, it must be provided in its Punycode form. If the TLD that the stream was loaded from is unavalable, None may be passed instead, which is equivalent to passing b"com". allow_utf8 : bool If set to False, the return value of this method won’t be "UTF-8". When performing detection on text/html on non-file: URLs, Web browsers must pass False, unless the user has taken a specific contextual action to request an override. This way, Web developers cannot start depending on UTF-8 detection. Such reliance would make the Web Platform more brittle.

## Returns

str The guessed encoding.

## Raises

pyo3_runtime.PanicException If tld contains non-ASCII, period, or upper-case letters. The exception condition is intentionally limited to signs of failing to extract the label correctly, failing to provide it in its Punycode form, and failure to lower-case it. Full DNS label validation is intentionally not performed to avoid panics when the reality doesn’t match the specs.

guess_assess(*, tld, allow_utf8)

Performs the same function as guess() with the same parameters, but additionally returns whether the guessed encoding had a higher score than at least one other candidate. If this method returns False, the guessed encoding is likely to be wrong.

## Parameters

tld : bytes or bytearray or None The rightmost DNS label of the hostname of the host the stream was loaded from in lower-case ASCII form. That is, if the label is an internationalized top-level domain name, it must be provided in its Punycode form. If the TLD that the stream was loaded from is unavalable, None may be passed instead, which is equivalent to passing b"com". allow_utf8 : bool If set to False, the return value of this method won’t be "UTF-8". When performing detection on text/html on non-file: URLs, Web browsers must pass False, unless the user has taken a specific contextual action to request an override. This way, Web developers cannot start depending on UTF-8 detection. Such reliance would make the Web Platform more brittle.

## Returns

encoding: str The guessed encoding. higher_score: bool Whether the guessed encoding had a higher score than at least one other candidate. If this value is False, the guessed encoding is likely to be wrong.

## Raises

pyo3_runtime.PanicException If tld contains non-ASCII, period, or upper-case letters. The exception condition is intentionally limited to signs of failing to extract the label correctly, failing to provide it in its Punycode form, and failure to lower-case it. Full DNS label validation is intentionally not performed to avoid panics when the reality doesn’t match the specs.