Messing around some with erlang, I've really been impressed with the bit syntax which makes dealing with binary data actually fun. In erlang, binaries can be expressed using the following syntax:
1> A = <<205>>.
<<"\315">>
So A is no bound to the binary value 205 - which is by default 1-byte in length. With the following syntax, we can extract the first four bits with out any masking:
2> <<_:4,B:4>> = A.
<<"\315">>
3> B.
13
We can also deduce the first 4 bits without shifting and/or masking:
4> <<C:4,_:4>> = A.
<<"\315">>
5> C.
12
So, what's the point? Think about binary protocols (mp3, unicode, ip4, ...). Dealing with this in your run-of-the mill language generally entails the same tedious mechanics - gross amounts of bit-shifting and masking values to arrive at the required values to decode the protocol, which necessitates a cognitive shift from the protocol's specification to the implementation. Using erlang's bit syntax allows me to almost match the protocol verbatim. Take for example, the following function which turns utf-8 encoded binary data (probably read from a file) into a list of code points (or blows up correctly if the data is corrupted):
01. utf8points(Bin) ->
02. utf8points(binary_to_list(Bin),[]).
03.
04. utf8points([], L) -> lists:reverse(L);
05. utf8points([H|T], L) ->
06. case <<H>> of
07. <<2#110:3, _:5>> -> decode2([H|T], L);
08. <<2#1110:4, _:4>> -> decode3([H|T], L);
09. <<2#11110:5,_:3>> -> decode4([H|T], L);
10. <<0:1,_:7>> -> utf8points(T, [H|L]);
11. _ -> exit({decode_error, {bad_bytes, [H]}})
12. end.
Ok, so here's a synopsis of the above code fragment:
(Lines 1-2) This function takes the binary (Bin), transforms it to a list of byte values and applies utf8points/2 on it.
(Line 4) We use the list L as an accumulator for the code points. If the list of byte values has been exhausted, we simply return the list reversed - since we are appending to the head of the accumulated list along the way..
(Lines 5-12) The list of bytes has not been exhausted. We consider the first byte of the list to decide whether this bytes alone or in combination with 1-3 bytes following it should be used to determine the code point. If it's one byte, (line 10), we prefix the accumulator and recurse with the remaining bytes. In the other cases we pass control to helper functions (decode2, decode3, decode4), depending on the leading bits of the byte. (2#N is used to represent numbers in base 2).
And decode3 uses some similar logic:
decode3([A,B,C|T], L) ->
case {<<B>>, <<C>>} of
{<<2#10:2,_:6>>,<<2#10:2,_:6>>} ->
<<V:16>> = <<A:4,B:6,C:6>>,
utf8points(T, [V|L]);
_ ->
exit({decode_error, {badbytes, [B,C]}})
end;
decode3(_, _) ->
exit(error_eof).
decode2 and decode4 are left as an exercise to the reader ;)