Trouble decoding malformed bytes to integer

Question

I have a simple python socket server receiving "command" code that is encoded in ASCII. Most bytes are decoded properly with utf-8 by doing data.decode("utf-8"), but for some of them, that converts to some random characters through latin-1.

Here are two examples

byte_string1 = b'\xa3\xb67'  # When client sends 67
byte_string2 = b'\xa3\xb6\xa3\xb6' #When client sends 66

I can see the number 67 and 6-6 in the input, but have been unable to extract them out. Is there a proper way to handle these?

My current attempt and I am expecting strings back from data in bytes:

def get_command(data):
    try:
        command = data.decode("utf-8")
    except UnicodeDecodeError as err1:
        logger.debug(f"utf-8 UnicodeDecodeError: {err1} for data: {data}")
        try:
            command = data.decode("latin-1")
        except UnicodeDecodeError as err2:
            logger.debug(f"latin-1 UnicodeDecodeError: {err2} for data: {data}")
            logger.debug(
                f"Taking a guess that the bytes are integers, for data: {data}"
            )
            command = [b for b in data]
    return command

server_ip = '0.0.0.0'
server_port = 1234

server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind((server_ip, server_port))
server_socket.listen(5)
while True:
    data = client_socket.recv(1024)
    if not data:
        break

    command = get_command(data)

If you have a command code encoded in ASCII, then you should never have a byte \xa3, which is outside of the ASCII range, and you shouldn't be trying to decode with anything other than ASCII. Thsi protocol could use a better description. — tdelaney, Commented Sep 16, 2023 at 22:51
Since you are trying to decode multi-byte sequences and I presume trying to decode a larger command header, note that socket recv does not honor any message boundaries you might thing exist from the remote send. It is purely byte oriented and if you want to decode something more than 1 byte long, you need to handle the receive not getting the data all at once. — tdelaney, Commented Sep 16, 2023 at 22:53
In hex, string 1 is 3 bytes A3 B6 37 and string 2 is 4 bytes A3 B6 A3 B6. I don't see how that translates to the client sending 67 and 66. Is there a standard header here? And by 66, do you mean 2 ascii digits? A single decimal integer ... or perhaps hex? — tdelaney, Commented Sep 16, 2023 at 22:58

NoName · Accepted Answer · 2023-09-17 08:57:53Z

Your issue is that you're trying to decode a custom byte encoding using standard decoders like UTF-8 and Latin-1. If the byte strings have a specific structure, you should extract the relevant parts manually.

In your case, it appears that the command bytes are encoded in the last part of the byte string. You can slice the byte string to get the relevant bytes.

Here's an optimized version of get_command():

def get_command(data):
    command_bytes = data[2:]  # Skipping first two bytes
    try:
        command = command_bytes.decode("utf-8")
    except UnicodeDecodeError:
        command = [b for b in command_bytes]
    return command

The above function assumes that the first two bytes are always irrelevant for your command decoding.

Update your main loop to incorporate this:

server_ip = '0.0.0.0'
server_port = 1234

server_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_socket.bind((server_ip, server_port))
server_socket.listen(5)

while True:
    client_socket, _ = server_socket.accept()
    data = client_socket.recv(1024)
    if not data:
        break

    command = get_command(data)

This should solve your problem hopefully.

If the high bit of a byte is used to indicate a new header, you can scan through the byte string to detect these headers and then process the payload bytes accordingly.

Here's a function to do that:

def get_commands(data):
    commands = []
    i = 0
    while i < len(data):
        if data[i] == 0xa3:  # Header byte
            i += 1  # Move to next byte
            if i < len(data):
                msb = data[i] & 0x80  # Most Significant Bit
                lsb = data[i] & 0x7F  # Least Significant Bits
                i += 1  # Move to next byte

                # Construct the command
                command = bytes([msb, lsb])
                if i < len(data):
                    while data[i] & 0x80 == 0:  # No high bit set
                        command += bytes([data[i]])
                        i += 1
                        if i >= len(data):
                            break
                commands.append(command.decode("utf-8", errors="ignore"))
    return commands

This approach assumes that a new header starts when the high bit is set. Modify as needed.

It seems like the header is at the front of the byte string \xa3 plus the most significant bit of the next byte, so \xb6 becomes \x80 and \x36. I don't know how one knows when a new header is starting. Perhaps its the next byte with the high bit set. — tdelaney, Commented Sep 16, 2023 at 23:07

Collectives™ on Stack Overflow

Trouble decoding malformed bytes to integer

1 Answer 1

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related