This
page
describes
work
done
at
the
School
of
Computer
Science
Glasgow
on the SICSA multicore challenge.
Concordance
The
first
application
proposed
was
a
file
concordance
application.
Specification
of
the
Concordance
application.
Given:
Text file containing
English text in ASCII encoding. An integer N.
Find:
For all sequences of
words, up to length N, occurring in the input file, the number of
occurrences of this sequence in the text, together with a list of start
indices. Optionally, sequences with only 1 occurrence should be omitted.
First
Serial
Experiments
Prior
to
doing
any
parallelisation
it
is
advisable
to
initially
set up a good sequential version. I was
initially doubtfull that this challenge would provide an effective
basis for parallelisation because it seemed such a simple problem.
Intutively it seems like a problem that is likely to be of either
linear or at worst log linear complexity, and for such problems,
especially ones involving text files, the time taken to read in the
file and print out the results can easily come to dominate the total
time taken. If a problem is disk bound, then there is little advantage
in expending effort to run it on multiple cores.
However that was only an hypothesis and
needed to be verified by
experiment.
In line with our school motto of programming to an interface not an
implementation, the interface above rather than the Haskell
implementation was chosen as the starting point. In order to get a bog
standard implementation, C was chosen as the implementation language.
The source code is linked to below.
The initial hypothesis was confirmed. The serial C implementation runs
sufficiently fast that there is no plausible reason to seek to
accelerate it by parallelisation.
Significantly faster than the Haskell version. This is not
surprising as one would expect C to be more efficient.
Appears to have a lower complexity order than the Haskell
version. This would indicate that the Haskell version is not a good
starting point for others attempting to produce a fast parallel version.
Its run time is dominated by the time to ouput the results to
file. This is shown by comparing the run times with printing enabled to
those with printing disabled.
The test files provided in the initial proposal were too short to
get a realistic estimate of performance, although they are useful for
programme debugging. We thus did our main tests using the World English
Bible, a link to whose source is given below.
Linux implementations run substantially faster than Windows on
the same processor and with gcc used in both cases. Probably file i/o
is worse under windows.
Parallelising
The concordance problem is hard to parallelise efficiently. You can not
just split a book into two halves, prepare a concordance for each half
and then merge, since we only have to print out repeated words and
phrases. A repeated word might be missed in this case if it was
mentioned once in the first half and once in the second half. A more
complicated approach is needed. We have chosen to parallelise by
getting all threads to read the entire book, since reading turns out to
be relatively fast. The words themselves are then divided into disjoint
sets - one obvious split would be into 26 sets on the first letter. We
then allocate each thread to do the concordance for a disjoint subset
of the words. A large part of the time is also taken up with output --
the printed concordance can be 5 times as large as the input file. If
distinct cores are producing this, there is an inevitable serial phase
in which the outputs of the different cores are merged into a single
serial file.
First Parallel Experiment
As a first parallel experiment a dual core version of the C programme
was produced using the pthreads library and it was tested on the same
dual processor machine as the original serial version of the algorithm.
Conclusions are derived from the results shown above:
There was no gain using multithreading on windows. It looks as if
the pthreads library under windows simply multi threads operations on a
single core rather than using both cores.
On Linux there was a small gain in performance due to
multithreading - about 17% faster in elapsed time using 2 cores.
Since a large part of the program execution is spent writing the
results out, this proves a challenge to multicore. First parallel
version adopted the strategy of allowing each thread to write its part
of the results to a different file. It is probably a good idea to
investigate pre buffereing the formated output in memory before writing
to disk as there is a danger of enforced serialisation during the
system calls to write parts of the results to disk if the individual
threads each write a small part of the data.
Second Parallel Experiment
This follows the same basic strategy as the previous one, but uses the
shell, rather than pthreads to fork parallel processes off and
communicates via files using the following command:
./l1concordance
WEB.txt
4
P
1
0
>WEB0.con& ./l1concordance
WEB.txt
4
P
1
1
>WEB1.con wait cat WEB1.con
>>WEB0.con
This has the best performance of the lot as seen in the attached
tables above.
SCC Experiment
We have also tried the programme out on the Intel SCC with
relatively
poor results.
The SCC is configured as a host processor, which is a conventional
modern intel x86 chip. Attached to this is the experimental 48 core SCC
chip, each of whose cores runs a discrete copy of Linux. A major worry
here was the problem of file i/o for the multiple cores. The source
file and the output files were placed on a shared NFS system, and
accessed from there. Looking at the time for one core doing the full
task one can see that it is much slower than a single core on the host
doing the same task. It is unclear how much of this slowdown is due to
the slow access to files from the daughter copies of Linux and how much
is due to the poorer performance of the individual cores on the SCC.
The top 3 lines of the table show the effects of trying to do
smaller
portions of the workload in an individual core.
Timinings of SCC doing the concordance test
elapsed time in seconds
1
core
doing full concordance
26.17
1
core
doing half concordance
13.48
1
core
doin 1/8 concordance
5.59
2
cores
doing half each
49
8
cores
doing 1/8 each
36
32
cores doing 1/32 each
34
host
processor doing it all
1.03
host
processor using 2 cores on host
0.685
We dispatched 32 tasks using the pssh command as follows
# shell script to run on host to run
concordance on 32
scc cores
rm /shared/stdout/*
pssh -t 800 -h hosts32 -o /shared/stdout
/shared/sccConcordance32
cat /shared/stdout/* |sort > WEB.con
The first line removes any temporary output from a previous run. We
then use pssh to run the script sccConcordance32 in a shared directory,
sending the output to the /shared/stdout directory/
When all tasks have run the output from all the tasks is
concattenated
and sorted to yield the final file.
The script sccConcordance32 invokes the actual concordance task
cd /shared
./l1concord WEB.txt 4 P 31 $(hostname)
The hostname is used to derive a process id which is then used to
select which words will be handled by the task. the 4th parameter to
l1concord is the mask that is applied to give the number of significant
bits in the process id, 5 in this case.
It is clear that on a file i/o bound task like this, the SCC is a
poor
platform.
Source
files
concordance.c
this is the serial program I tested
USAGE concordance textfile N [P]
the optional P indicates whether printing is to be switched on.
concordance2.c this is a dual core
version of the above programme, it generates its output in two files
which later have to be concatenated.
l1concordance.c this is a version
of the programme that is designed to be run from the shell as two
cores. This is also the version run on the SCC
WEB.txt this
is the World English Bible that I used as test data.
Notes
on
the
Algorithm
and
data
structures
The
data
structure
used
was
a
hash
table
linking
to
prefix
trees
for each unique word encountered in the file.
Strings were tokenised using a finite state recogniser. Prefix trees
were a modified form of Trie
with sorting at each level by hashcode.
The algorithm is four pass.
Read the file into a buffer
Produce a tokenised version of the buffer
Build the hash table and prefix trees.
If the concordance is to be printed out, perform a traversal of
the trees printing out the word sequences in the format suggested by
the Haskell version.