SICSA Multicore Challenge

Paul Cockshott

This page describes work done at the School of Computer Science Glasgow on the SICSA multicore challenge.

Concordance

The first application proposed was a file concordance application.

Specification of the Concordance application.

Given: Text file containing English text in ASCII encoding. An integer N.

Find: For all sequences of words, up to length N, occurring in the input file, the number of occurrences of this sequence in the text, together with a list of start indices. Optionally, sequences with only 1 occurrence should be omitted.

First Serial Experiments

Prior to doing any parallelisation it is advisable to initially set up a good sequential version. I was initially doubtfull that this challenge would provide an effective basis for parallelisation because it seemed such a simple problem. Intutively it seems like a problem that is likely to be of either linear or at worst log linear complexity, and for such problems, especially ones involving text files, the time taken to read in the file and print out the results can easily come to dominate the total time taken. If a problem is disk bound, then there is little advantage in expending effort to run it on multiple cores.

However that was only an hypothesis and needed to be verified by experiment.
In line with our school motto of programming to an interface not an implementation, the interface above rather than the Haskell implementation was chosen as the starting point. In order to get a bog standard implementation, C was chosen as the implementation language.

The source code is linked to below.

The initial hypothesis was confirmed. The serial C implementation runs sufficiently fast that there is no plausible reason to seek to accelerate it by parallelisation.

SUMMARY OF TESTS OF SERIAL AND PARALLEL VERSIONS

As the summary above shows the C version is

Significantly faster than the Haskell version. This is not surprising as one would expect C to be more efficient.
Appears to have a lower complexity order than the Haskell version. This would indicate that the Haskell version is not a good starting point for others attempting to produce a fast parallel version.
Its run time is dominated by the time to ouput the results to file. This is shown by comparing the run times with printing enabled to those with printing disabled.
The test files provided in the initial proposal were too short to get a realistic estimate of performance, although they are useful for programme debugging. We thus did our main tests using the World English Bible, a link to whose source is given below.
Linux implementations run substantially faster than Windows on the same processor and with gcc used in both cases. Probably file i/o is worse under windows.

Parallelising

The concordance problem is hard to parallelise efficiently. You can not just split a book into two halves, prepare a concordance for each half and then merge, since we only have to print out repeated words and phrases. A repeated word might be missed in this case if it was mentioned once in the first half and once in the second half. A more complicated approach is needed. We have chosen to parallelise by getting all threads to read the entire book, since reading turns out to be relatively fast. The words themselves are then divided into disjoint sets - one obvious split would be into 26 sets on the first letter. We then allocate each thread to do the concordance for a disjoint subset of the words. A large part of the time is also taken up with output -- the printed concordance can be 5 times as large as the input file. If distinct cores are producing this, there is an inevitable serial phase in which the outputs of the different cores are merged into a single serial file.

First Parallel Experiment

As a first parallel experiment a dual core version of the C programme was produced using the pthreads library and it was tested on the same dual processor machine as the original serial version of the algorithm. Conclusions are derived from the results shown above:

There was no gain using multithreading on windows. It looks as if the pthreads library under windows simply multi threads operations on a single core rather than using both cores.
On Linux there was a small gain in performance due to multithreading - about 17% faster in elapsed time using 2 cores.
Since a large part of the program execution is spent writing the results out, this proves a challenge to multicore. First parallel version adopted the strategy of allowing each thread to write its part of the results to a different file. It is probably a good idea to investigate pre buffereing the formated output in memory before writing to disk as there is a danger of enforced serialisation during the system calls to write parts of the results to disk if the individual threads each write a small part of the data.

Second Parallel Experiment

This follows the same basic strategy as the previous one, but uses the shell, rather than pthreads to fork parallel processes off and communicates via files using the following command:

./l1concordance WEB.txt 4 P 1 0 >WEB0.con&
./l1concordance WEB.txt 4 P 1 1 >WEB1.con
wait
cat WEB1.con >>WEB0.con

This has the best performance of the lot as seen in the attached tables above.

SCC Experiment

We have also tried the programme out on the Intel SCC with relatively poor results.

The SCC is configured as a host processor, which is a conventional modern intel x86 chip. Attached to this is the experimental 48 core SCC chip, each of whose cores runs a discrete copy of Linux. A major worry here was the problem of file i/o for the multiple cores. The source file and the output files were placed on a shared NFS system, and accessed from there. Looking at the time for one core doing the full task one can see that it is much slower than a single core on the host doing the same task. It is unclear how much of this slowdown is due to the slow access to files from the daughter copies of Linux and how much is due to the poorer performance of the individual cores on the SCC.

The top 3 lines of the table show the effects of trying to do smaller portions of the workload in an individual core.

Timinings of SCC doing the concordance test

elapsed time in seconds

1 core doing full concordance 26.17

1 core doing half concordance 13.48

1 core doin 1/8 concordance 5.59

2 cores doing half each 49

8 cores doing 1/8 each 36

32 cores doing 1/32 each 34

host processor doing it all 1.03

host processor using 2 cores on host
0.685

We dispatched 32 tasks using the pssh command as follows

# shell script to run on host to run concordance on 32 scc cores
rm /shared/stdout/*
pssh -t 800 -h hosts32 -o /shared/stdout /shared/sccConcordance32
cat /shared/stdout/* |sort > WEB.con

The first line removes any temporary output from a previous run. We then use pssh to run the script sccConcordance32 in a shared directory, sending the output to the /shared/stdout directory/

When all tasks have run the output from all the tasks is concattenated and sorted to yield the final file.

The script sccConcordance32 invokes the actual concordance task

cd /shared

./l1concord WEB.txt 4 P 31 $(hostname)

The hostname is used to derive a process id which is then used to select which words will be handled by the task. the 4th parameter to l1concord is the mask that is applied to give the number of significant bits in the process id, 5 in this case.

It is clear that on a file i/o bound task like this, the SCC is a poor platform.

Source files

concordance.c this is the serial program I tested
USAGE concordance textfile N [P]
the optional P indicates whether printing is to be switched on.

concordance2.c this is a dual core version of the above programme, it generates its output in two files which later have to be concatenated.

l1concordance.c this is a version of the programme that is designed to be run from the shell as two cores. This is also the version run on the SCC

WEB.txt this is the World English Bible that I used as test data.

Notes on the Algorithm and data structures

The data structure used was a hash table linking to prefix trees for each unique word encountered in the file. Strings were tokenised using a finite state recogniser. Prefix trees were a modified form of Trie with sorting at each level by hashcode.
The algorithm is four pass.

Read the file into a buffer
Produce a tokenised version of the buffer
Build the hash table and prefix trees.
If the concordance is to be printed out, perform a traversal of the trees printing out the word sequences in the format suggested by the Haskell version.

Site tools

Site navigation

Paul Cockshotts Home Page

SICSA Challenge Home Page