Digital Media Web Blogs > Web

Using C++ Intrinsic Functions for Pipelined Text Processing


A good C++ programming technique that has almost no published material available on the WWW relates to using the special pipeline instructions in modern CPUs for faster text processing.

Typically compilers (such as for C++ and Java) find it hard to detect possible parallelism in programs: indeed, this is because typically you have to structure the algorithm to suit the CPU's capabilities, which goes against the desired requirements for cross-platform, Write Once Read Anywhere code. It used to require assembler programming, but modern compilers support a special class of functions called intrinsics which are more-or-less macros that expand to assembler code sections.

Lets take the UTF-8 to UTF-16 conversion function used in libxml as our example. Its pretty typical code; it doesn't look too bad to me. UTF-8 is a variable-width encoding of Unicode: it uses one byte for ASCII characters (and with exactly the same codes), two bytes for most Western characters (both bytes have their top bit set), three bytes for most Chinese characters, and four bytes for new rarer characters. UTF-16 is a fixed width encoding of two bytes. (We ignore the issue of surrogates here.)

Here's the original code (with some dead code removed):

/**
* UTF8ToUTF16LE:
* @outb:  a pointer to an array of bytes to store the result
* @outlen:  the length of @outb
* @in:  a pointer to an array of UTF-8 chars
* @inlen:  the length of @in
*
* Take a block of UTF-8 chars in and try to convert it to an UTF-16LE
* block of chars out.
*
* Returns the number of bytes written, or -1 if lack of space, or -2
*     if the transcoding failed. 
*/ 

static int
UTF8ToUTF16LE(unsigned char* outb, int *outlen,
const unsigned char* in, int *inlen)
{
unsigned short* out = (unsigned short*) outb;
const unsigned char* processed = in;
const unsigned char *const instart = in;
unsigned short* outstart= out;
unsigned short* outend;
const unsigned char* inend= in+*inlen;
unsigned int c, d;
int trailing;

/* UTF16LE encoding has no BOM */
if ((out == NULL) || (outlen == NULL) || (inlen == NULL)) return(-1);
if (in == NULL) {
*outlen = 0;
*inlen = 0;
return(0);
}
outend = out + (*outlen / 2);

// PARALLEL OPTIMIZATION TO GO HERE !!

// Output the rest the usual way
while (in < inend) {
d= *in++;
if (d < 0x80) { c= d; trailing= 0; }
else if (d < 0xC0) {
/* trailing byte in leading position */
*outlen = (out - outstart) * 2;
*inlen = processed - instart;
return(-2);
} else if (d < 0xE0) { c= d & 0x1F; trailing= 1; }
else if (d < 0xF0) { c= d & 0x0F; trailing= 2; }
else if (d < 0xF8) { c= d & 0x07; trailing= 3; }
else {
/* no chance for this in UTF-16 */
*outlen = (out - outstart) * 2;
*inlen = processed - instart;
return(-2);
}

if (inend - in < trailing) {
break;
}

for ( ; trailing; trailing--) {
if ((in >= inend) || (((d= *in++) & 0xC0) != 0x80))
break;
c <<= 6;
c |= d & 0x3F;
}

/* assertion: c is a single UTF-4 value */
if (c < 0x10000) {
if (out >= outend)
break;

*out++ = c;
}
else if (c < 0x110000) {
if (out+1 >= outend)
break;
c -= 0x10000;
*out++ = 0xD800 | (c >> 10);
}
else
break;
processed = in;
}
*outlen = (out - outstart) * 2;
*inlen = processed - instart;
return(*outlen);
}

The optimization is very low-impact and modest in scope, but it is enough to show how to use intrinsics for this purpose. At the point marked PARALLEL OPTIMIZATION TO GO HERE !! we add code that tests 16 bytes chunks at a time and tests whether they are all ASCII characters. If so, then we use an efficient unrolled series of assignments to transfer the input bytes to the output. The first time any non-ASCII byte is found in a 16 byte chunk, that chunk and all the rest of the file are processed in the original code's way. In the worst case, when the input has no ASCII characters, there will only be a few cycles penalty; in the best case, when in the input has only ASCII characters, all characters (except any straggling characters where the length is not a multiple of 16) will be processed by the optimized code section; inputs which start off ASCII then get some non-ASCII later (such an English HTML page that may have a few special publishing characters) benefit.

The same optimization could be used for encoders for 8-bit fixed and variable encoders, such as a CP1252 to UTF-16 transcoder: any ASCII compatible encoding where code points less than 0x80 are only used for ASCII characters.

First, we need to add an extra header, which declares the intrinsics functions available to access the SSE2 operations on modern x86 CPUs subsequent to the Pentium 4:

#include <emmintrin.h>

And here is the code. __m128i is a datatype built into recent x86 compilers to represent a 16 byte chunk of memory used to contain integers of various sizes: in our case here, we use intrinsic functions that treat the datatype as 16 signed chars.

	while (in +16  <= inend &&
		!( _mm_movemask_epi8(
			_mm_cmplt_epi8(
				_mm_load_si128 ((__m128i*)in) ,
				_mm_setzero_si128() )))) { 

// unrolled output
*out++ = *in++; // 1
*out++ = *in++; // 2
*out++ = *in++; // 3
*out++ = *in++; // 4
*out++ = *in++; // 5
*out++ = *in++; // 6
*out++ = *in++; // 7
*out++ = *in++; // 8
*out++ = *in++; // 9
*out++ = *in++; // 10
*out++ = *in++; // 11
*out++ = *in++; // 12
*out++ = *in++; // 13
*out++ = *in++; // 14
*out++ = *in++; // 15
*out++ = *in++; // 16

processed += 16; // correct the count
}


The first and last lines are housekeeping. The second line contains
the interesting bits: the intrinsic functions.
The rest of the body are the simple unrolled data transfers.
The code does assume that the input and output arrays are 16-byte aligned,
as required by the intrinsic functions.

So what is going on? The _mm_load_si128() intrinsic function takes a pointer to the input (cast appropriately) and starts a parallelized load of that chunk. The next stage of the pipeline checks for bytes that are less than zero using _mm_cmplt_epi8(): this will find any bytes not in the ASCII range, because it treats the bytes as signed chars. This results in an notional __m128i containing 0x00 in each byte position if all the characters are ASCII. The _mm_movemask_epi8()

intrinsic function then constructs an integer from the top bits of each
of the 16 bytes, and we use that integer for our test. Phew!

How much increase does this give us? Testing the best case on my new notebook with a recent compiler, all ASCII input, inputs larger than 1K have a four times increase in processing rate (at best, from 12.7 cycles per byte to 2.7 cycles per byte.) Even for small 100 byte input, the processing rate is doubled. So even this very basic optimization is worthwhile; and I am sure various improvements would suggest themselves to any reader who looked further.

The SSE2 streaming operations on x86 CPUs are not well suited for the needs of text processing, and could be improved. They are much more aimed at the needs of floating point and integer mathematics; however, I hope this example shows that they can be useful for some text operations too. One of the justifications for XML Binary Infosets is that text operations required by XML parsing are too CPU intensive; this example may perhaps counter that to some extent: optimizations based on the use of intrinsic functions will become more common as x86 systems which don't support SSE2 become obsolete, and some of these optimizations may be relevant for XML parser writers, certainly for transcoder writers.

How geeky is that? This is not code I am maintaining; please put any fixes or suggestions below!

Categories





AddThis Social Bookmark Button



Comments (2)
Read More Entries by Rick Jelliffe.

2 Comments

rjelliffe said:

cool
There is no way in 100% Pure Java.

So you would have to make this into a DLL, then call it using JNI. See http://java.sun.com/docs/books/tutorial/native1.1/
in particular
http://java.sun.com/docs/books/tutorial/native1.1/implementing/array.html
(I don't know how much setup penalty a JNI call has: that is something you would need to double check to make sure the optimization was worthwhile if you have small documents.)

The Intrinsics are processor dependent (and their performance will vary between members of the same processor family) so there must be different code if you are using a powerPC (Apple has a nice page about this,sorry no link).

Rick

anjanb said:

cool
hi rick,

thanks for the info.

how do I take advantage of this in java ? OR can I tell the java interpreter how to make use of it ?

thank you,



BR,

~A

Topics of Interest

Related Books

Recommended for You

Archives


 
 


Or, visit our complete archive.  

Stay Connected