Class Reference¶
This is the main python binding for chardetng, a rust character encoding detector for legacy Web content.
The class here is a wrapper around the rust struct EncodingDetector.
The documentation for the rust structure is available on
docs.rs.
For more information about the overall function of the library, read Henri Sivonen’s excellent write-up.
chardetng_py.detector¶
- class chardetng_py.detector.EncodingDetector¶
A Web browser-oriented detector for guessing what character encoding a stream of bytes is encoded in.
The bytes are fed to the detector incrementally using the
feedmethod. The current guess of the detector can be queried using theguessmethod. The guessing parameters are arguments to theguessmethod rather than arguments to the constructor in order to enable the application to check if the arguments affect the guessing outcome. (The specific use case is to disable UI for re-running the detector with UTF-8 allowed and the top-level domain name ignored if those arguments don’t change the guess.)Methods
feed(buffer, *, last)Inform the detector of a chunk of input.
guess(*, tld, allow_utf8)Guess the encoding given the bytes pushed to the detector so far (via
feed()), the top-level domain name from which the bytes were loaded, and an indication of whether to consider UTF-8 as a permissible guess.guess_assess(*, tld, allow_utf8)Performs the same function as
guess()with the same parameters, but additionally returns whether the guessed encoding had a higher score than at least one other candidate.- feed(buffer, *, last)¶
Inform the detector of a chunk of input.
The byte stream is represented as a sequence of calls to this method such that the concatenation of the arguments to this method form the byte stream. It does not matter how the application chooses to chunk the stream. It is OK to call this method with a zero-length byte slice.
The end of the stream is indicated by calling this method with
lastset toTrue. In that case, the end of the stream is considered to occur after the last byte of thebuffer(which may be zero-length) passed in the same call. Once this method has been called withlastset toTruethis method must not be called again.If you want to perform detection on just the prefix of a longer stream, do not pass
last=Trueafter the prefix if the stream actually still continues.Returns
Trueif after processingbufferthe stream has contained at least one non-ASCII byte andFalseif only ASCII has been seen so far.## Parameters
buffer :
bytesorbytearrayThe next chunk of the byte stream. last :boolWhether this is the last chunk of the byte stream.## Returns
boolTrueif the stream has contained at least one non-ASCII byte andFalseif only ASCII has been seen so far.## Raises
pyo3_runtime.PanicException If this method has previously been called with
lastset toTrue.
- guess(*, tld, allow_utf8)¶
Guess the encoding given the bytes pushed to the detector so far (via
feed()), the top-level domain name from which the bytes were loaded, and an indication of whether to consider UTF-8 as a permissible guess.## Parameters
tld :
bytesorbytearrayorNoneThe rightmost DNS label of the hostname of the host the stream was loaded from in lower-case ASCII form. That is, if the label is an internationalized top-level domain name, it must be provided in its Punycode form. If the TLD that the stream was loaded from is unavalable,Nonemay be passed instead, which is equivalent to passingb"com". allow_utf8 :boolIf set toFalse, the return value of this method won’t be"UTF-8". When performing detection ontext/htmlon non-file:URLs, Web browsers must passFalse, unless the user has taken a specific contextual action to request an override. This way, Web developers cannot start depending on UTF-8 detection. Such reliance would make the Web Platform more brittle.## Returns
strThe guessed encoding.## Raises
pyo3_runtime.PanicException If
tldcontains non-ASCII, period, or upper-case letters. The exception condition is intentionally limited to signs of failing to extract the label correctly, failing to provide it in its Punycode form, and failure to lower-case it. Full DNS label validation is intentionally not performed to avoid panics when the reality doesn’t match the specs.
- guess_assess(*, tld, allow_utf8)¶
Performs the same function as
guess()with the same parameters, but additionally returns whether the guessed encoding had a higher score than at least one other candidate. If this method returnsFalse, the guessed encoding is likely to be wrong.## Parameters
tld :
bytesorbytearrayorNoneThe rightmost DNS label of the hostname of the host the stream was loaded from in lower-case ASCII form. That is, if the label is an internationalized top-level domain name, it must be provided in its Punycode form. If the TLD that the stream was loaded from is unavalable,Nonemay be passed instead, which is equivalent to passingb"com". allow_utf8 :boolIf set toFalse, the return value of this method won’t be"UTF-8". When performing detection ontext/htmlon non-file:URLs, Web browsers must passFalse, unless the user has taken a specific contextual action to request an override. This way, Web developers cannot start depending on UTF-8 detection. Such reliance would make the Web Platform more brittle.## Returns
encoding:
strThe guessed encoding. higher_score:boolWhether the guessed encoding had a higher score than at least one other candidate. If this value isFalse, the guessed encoding is likely to be wrong.## Raises
pyo3_runtime.PanicException If
tldcontains non-ASCII, period, or upper-case letters. The exception condition is intentionally limited to signs of failing to extract the label correctly, failing to provide it in its Punycode form, and failure to lower-case it. Full DNS label validation is intentionally not performed to avoid panics when the reality doesn’t match the specs.