This file is indexed.

/usr/share/doc/mrmpi-doc/Interface_python.html is in mrmpi-doc 1.0~20140404-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
<HTML>
<CENTER><A HREF = "http://mapreduce.sandia.gov">MapReduce-MPI WWW Site</A> - <A HREF = "Manual.html">MapReduce-MPI Documentation</A> 
</CENTER>




<HR>

<H3>Python interface to the MapReduce-MPI Library 
</H3>
<P>A Python wrapper for the MR-MPI library is included in the
distribution.  The advantage of using Python is how concise the
language is, enabling rapid development and debugging of MapReduce
programs.  The disadvantage is speed, since Python is slower than a
compiled language.  Using the MR-MPI library from Python incurs two
additional overheads, discussed in the <A HREF = "Technical.html">Technical
Details</A> section.
</P>
<P>Before using MR-MPI from a Python script, you need to do two things.
You need to build MR-MPI as a dynamic shared library, so it can be
loaded by Python.  And you need to tell Python how to find the library
and the Python wrapper file python/mrmpi.py.  Both these steps are
discussed below.  If you wish to run MR-MPI in parallel from Python,
you also need to extend your Python with MPI.  This is also discussed
below.
</P>
<P>The Python wrapper for MR-MPI uses the amazing and magical (to me)
"ctypes" package in Python, which auto-generates the interface code
needed between Python and a set of C interface routines for a library.
Ctypes is part of standard Python for versions 2.5 and later.  You can
check which version of Python you have installed, by simply typing
"python" at a shell prompt.
</P>
<P>The following sub-sections cover the rest of the Python discussion:
</P>
<UL><LI><A HREF = "#py_1">Building MR-MPI as a shared library</A>
<LI><A HREF = "#py_2">Installing the Python wrapper into Python</A>
<LI><A HREF = "#py_3">Extending Python with MPI to run in parallel</A>
<LI><A HREF = "#py_4">Testing the Python/MR-MPI interface</A>
<LI><A HREF = "#py_5">Using the MR-MPI library from Python</A> 
</UL>
<HR>

<HR>

<A NAME = "py_1"></A><B>Building MR-MPI as a shared library</B> 

<P>Instructions on how to build MR-MPI as a shared library are given in
the <A HREF = "Start.html">Start section</A>. A shared library is one that is
dynamically loadable, which is what Python requires.  On Linux this is
a library file that ends in ".so", not ".a".
</P>
<P>From the src directory, type
</P>
<PRE>make -f Makefile.shlib foo 
</PRE>
<P>where foo is the machine target name, such as linux or g++ or serial.
This should create the file libmrmpir_foo.so in the src directory, as
well as a soft link libmrmpi.so, which is what the Python wrapper will
load by default.  Note that if you are building multiple machine
versions of the shared library, the soft link is always set to the
most recently built version.
</P>
<P>If this fails, see the <A HREF = "Start.html">Start section</A> for more details.
</P>
<HR>

<A NAME = "py_2"></A><B>Installing the Python wrapper into Python</B> 

<P>For Python to invoke MR-MPI, there are 2 files it needs to know about:
</P>
<UL><LI>python/mrmpi.py
<LI>src/libmrmpi.so 
</UL>
<P>Mrmpi.py is the Python wrapper on the MR-MPI library interface.
Libmrmpi.so is the shared MR-MPI library that Python loads, as
described above.
</P>
<P>You can insure Python can find these files in one of two ways:
</P>
<UL><LI>set two environment variables
<LI>run the python/install.py script 
</UL>
<P>If you set the paths to these files as environment variables, you only
have to do it once.  For the csh or tcsh shells, add something like
this to your ~/.cshrc file, one line for each of the two files:
</P>
<PRE>setenv PYTHONPATH $<I>PYTHONPATH</I>:/home/sjplimp/mrmpi/python
setenv LD_LIBRARY_PATH $<I>LD_LIBRARY_PATH</I>:/home/sjplimp/mrmpi/src 
</PRE>
<P>If you use the python/install.py script, you need to invoke it every
time you rebuild MR-MPI (as a shared library) or make changes to the
python/mrmpi.py file.
</P>
<P>You can invoke install.py from the python directory as
</P>
<PRE>% python install.py [libdir] [pydir] 
</PRE>
<P>The optional libdir is where to copy the MR-MPI shared library to; the
default is /usr/local/lib.  The optional pydir is where to copy the
mrmpi.py file to; the default is the site-packages directory of the
version of Python that is running the install script.
</P>
<P>Note that libdir must be a location that is in your default
LD_LIBRARY_PATH, like /usr/local/lib or /usr/lib.  And pydir must be a
location that Python looks in by default for imported modules, like
its site-packages dir.  If you want to copy these files to
non-standard locations, such as within your own user space, you will
need to set your PYTHONPATH and LD_LIBRARY_PATH environment variables
accordingly, as above.
</P>
<P>If the install.py script does not allow you to copy files into system
directories, prefix the python command with "sudo".  If you do this,
make sure that the Python that root runs is the same as the Python you
run.  E.g. you may need to do something like
</P>
<PRE>% sudo /usr/local/bin/python install.py [libdir] [pydir] 
</PRE>
<P>You can also invoke install.py from the make command in the src
directory as
</P>
<PRE>% make install-python 
</PRE>
<P>In this mode you cannot append optional arguments.  Again, you may
need to prefix this with "sudo".  In this mode you cannot control
which Python is invoked by root.
</P>
<P>Note that if you want Python to be able to load different versions of
the MR-MPI shared library (see <A HREF = "#py_5">this section</A> below), you will
need to manually copy files like lmpmrmpi_g++.so into the appropriate
system directory.  This is not needed if you set the LD_LIBRARY_PATH
environment variable as described above.
</P>
<HR>

<A NAME = "py_3"></A><H4>11.3 Extending Python with MPI to run in parallel 
</H4>
<P>If you wish to run MR-MPI in parallel from Python, you need to extend
your Python with an interface to MPI.  This also allows you to make
MPI calls directly from Python in your script, if you desire.
</P>
<P>There are several Python packages available that purport to wrap MPI
as a library and allow MPI functions to be called from Python.
</P>
<P>These include
</P>
<UL><LI><A HREF = "http://pympi.sourceforge.net/">pyMPI</A>
<LI><A HREF = "http://code.google.com/p/maroonmpi/">maroonmpi</A>
<LI><A HREF = "http://code.google.com/p/mpi4py/">mpi4py</A>
<LI><A HREF = "http://nbcr.sdsc.edu/forum/viewtopic.php?t=89&sid=c997fefc3933bd66204875b436940f16">myMPI</A>
<LI><A HREF = "http://code.google.com/p/pypar">Pypar</A> 
</UL>
<P>All of these except pyMPI work by wrapping the MPI library and
exposing (some portion of) its interface to your Python script.  This
means Python cannot be used interactively in parallel, since they do
not address the issue of interactive input to multiple instances of
Python running on different processors.  The one exception is pyMPI,
which alters the Python interpreter to address this issue, and (I
believe) creates a new alternate executable (in place of "python"
itself) as a result.
</P>
<P>In principle any of these Python/MPI packages should work to invoke
MR-MPI in parallel and MPI calls themselves from a Python script which
is itself running in parallel.  However, when I downloaded and looked
at a few of them, their documentation was incomplete and I had trouble
with their installation.  It's not clear if some of the packages are
still being actively developed and supported.
</P>
<P>The one I recommend, since I have successfully used it with MR-MPI, is
Pypar.  Pypar requires the ubiquitous <A HREF = "http://numpy.scipy.org">Numpy
package</A> be installed in your Python.  After
launching python, type
</P>
<PRE>import numpy 
</PRE>
<P>to see if it is installed.  If not, here is how to install it (version
1.3.0b1 as of April 2009).  Unpack the numpy tarball and from its
top-level directory, type
</P>
<PRE>python setup.py build
sudo python setup.py install 
</PRE>
<P>The "sudo" is only needed if required to copy Numpy files into your
Python distribution's site-packages directory.
</P>
<P>To install Pypar (version pypar-2.1.4_94 as of Aug 2012), unpack it
and from its "source" directory, type
</P>
<PRE>python setup.py build
sudo python setup.py install 
</PRE>
<P>Again, the "sudo" is only needed if required to copy Pypar files into
your Python distribution's site-packages directory.
</P>
<P>If you have successully installed Pypar, you should be able to run
Python and type
</P>
<PRE>import pypar 
</PRE>
<P>without error.  You should also be able to run python in parallel
on a simple test script
</P>
<PRE>% mpirun -np 4 python test.py 
</PRE>
<P>where test.py contains the lines
</P>
<PRE>import pypar
print "Proc %d out of %d procs" % (pypar.rank(),pypar.size()) 
</PRE>
<P>and see one line of output for each processor you run on.
</P>
<P>IMPORTANT NOTE: To use Pypar and MR-MPI in parallel from Python, you
must insure both are using the same version of MPI.  If you only have
one MPI installed on your system, this is not an issue, but it can be
if you have multiple MPIs.  Your MR-MPI build is explicit about which
MPI it is using, since you specify the details in your lo-level
src/MAKE/Makefile.foo file.  Pypar uses the "mpicc" command to find
information about the MPI it uses to build against.  And it tries to
load "libmpi.so" from the LD_LIBRARY_PATH.  This may or may not find
the MPI library that MR-MPI is using.  If you have problems running
both Pypar and MR-MPI together, this is an issue you may need to
address, e.g. by moving other MPI installations so that Pypar finds
the right one.
</P>
<HR>

<A NAME = "py_4"></A><H4>11.4 Testing the Python-MR-MPI interface 
</H4>
<P>To test if MR-MPI is callable from Python in serial, launch Python
interactively and type:
</P>
<PRE>>>> from mrmpi import mrmpi
>>> mr = mrmpi() 
</PRE>
<P>If you get no errors, you're ready to use MR-MPI from Python.  If the
2nd command fails, the most common error to see is
</P>
<PRE>OSError: Could not load MR-MPI dynamic library 
</PRE>
<P>which means Python was unable to load the MR-MPI shared library.  This
typically occurs if the system can't find the MR-MPI shared library,
or if something about the library is incompatible with your Python.
The error message should give you an indication of what went wrong.
</P>
<P>You can also test the load directly in Python as follows, without
first importing from the mrmpi.py file:
</P>
<PRE>>>> from ctypes import CDLL
>>> CDLL("libmrmpi.so") 
</PRE>
<P>If an error occurs, carefully go thru the steps in <A HREF = "Start.html">Start</A>
and above about building a shared library and about insuring Python
can find the necessary two files it needs.
</P>
<H5><B>Test MR-MPI and Python in parallel:</B> 
</H5>
<P>To run MR-MPI in parallel, assuming you have installed the
<A HREF = "http://datamining.anu.edu.au/~ole/pypar">Pypar</A> package as discussed
above, create a test.py file containing these lines:
</P>
<PRE>import pypar
from mrmpi import mrmpi
mr = mrmpi()
print "Proc %d out of %d procs has" % (pypar.rank(),pypar.size()),mr
pypar.finalize() 
</PRE>
<P>You can then run it in parallel as:
</P>
<PRE>% mpirun -np 4 python test.py 
</PRE>
<P>Note that if you leave out the 3 lines from test.py that specify Pypar
commands you will instantiate and run MR-MPI independently on each of
the P processors specified in the mpirun command.  In this case you
should get 4 sets of output, each showing that a MR-MPI was
initialized on a single processor, instead of one set of output
showing MR-MPI was initialized on 4 processors.  If the 1-processor
outputs occur, it means that Pypar is not working correctly.
</P>
<P>Also note that once you import the PyPar module, Pypar initializes MPI
for you, and you can use MPI calls directly in your Python script, as
described in the Pypar documentation.  The last line of your Python
script should be pypar.finalize(), to insure MPI is shut down
correctly.
</P>
<H5><B>Running Python scripts:</B> 
</H5>
<P>Note that any Python script (not just for MR-MPI) can be invoked in
one of several ways:
</P>
<PRE>% python foo.script
% python -i foo.script
% foo.script 
</PRE>
<P>The last command requires that the first line of the script be
something like this:
</P>
<PRE>#!/usr/local/bin/python 
#!/usr/local/bin/python -i 
</PRE>
<P>where the path points to where you have Python installed, and that you
have made the script file executable:
</P>
<PRE>% chmod +x foo.script 
</PRE>
<P>Without the "-i" flag, Python will exit when the script finishes.
With the "-i" flag, you will be left in the Python interpreter when
the script finishes, so you can type subsequent commands.  As
mentioned above, you can only run Python interactively when running
Python on a single processor, not in parallel.
</P>
<HR>

<HR>

<A NAME = "py_5"></A><B>Using the MR-MPI library from Python</B> 

<P>The Python interface to MR-MPI consists of a Python "mrmpi" module,
the source code for which is in python/mrmpi.py, which creates a
"mrmpi" object, with a set of methods that can be invoked on that
object.  The sample Python code below assumes you have first imported
the "mrmpi" module in your Python script, as follows:
</P>
<PRE>from mrmpi import mrmpi 
</PRE>
<P>These are the methods defined by the mrmpi module.  Some of them take
callback functions as arguments, e.g. <A HREF = "map.html">map()</A> and
<A HREF = "reduce.html">reduce()</A>.  These are Python functions you define
elsewhere in your script.  When you register "keys" and "values" with
the library, they can be simple quantities like strings or ints or
floats.  Or they can be Python data structures like lists or tuples.
</P>
<P>These are the class methods defined by the mrmpi module.  Their
functionality and arguments are described in the <A HREF = "Interface_c++.html">C++ interface
section</A>.
</P>
<PRE>mr = mrmpi()                # create a MR-MPI object using the default libmrmpi.so library
mr = mrmpi(mpi_comm)        # ditto, but with a specified MPI communicator
mr = mrmpi(0.0)             # ditto, and the library will finalize MPI
mr = mrmpi(None,"g++")      # create a MR-MPI object using the libmrmpi_g++.so library
mr = mrmpi(mpi_comm,"g++")  # ditto, but with a specified MPI communicator
mr = mrmpi(0.0,"g++")       # ditto, and the library will finalize MPI 
</PRE>
<PRE>mr2 = mr.copy()             # copy mr to create mr2 
</PRE>
<PRE>mr.destroy()                # destroy an mrmpi object, freeing its memory
                            # this will also occur if Python garbage collects 
</PRE>
<PRE>mr.add(mr2)
mr.aggregate()
mr.aggregate(myhash)        # if specified, myhash is a hash function
			    #   called back from the library as myhash(key)
			    # myhash() should return an integer (a proc ID)
mr.broadcast(root)
mr.clone()
mr.close()
mr.collapse(key)
mr.collate()
mr.collate(myhash)          # if specified, myhash is the same function
			    #   as for aggregate() 
</PRE>
<PRE>mr.compress(mycompress)     # mycompress is a function called back from the
			    #   library as mycompress(key,mvalue,mr,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key, mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your mycompress function should typically
			    #   make calls like mr->add(key,value)
mr.compress(mycompress,ptr) # if specified, ptr is any Python datum
		            #    and is passed back to your mycompress()
			    # if not specified, ptr = None 
</PRE>
<PRE>mr.convert()
mr.gather(nprocs) 
</PRE>
<PRE>mr.map(nmap,mymap)          # mymap is a function called back from the
			    #   library as mymap(itask,mr,ptr)
			    #   where mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your mymap function should typically
			    #   make calls like mr->add(key,value)
mr.map(nmap,mymap,ptr)      # if specified, ptr is any Python datum
			    #    and is passed back to your mymap()
			    # if not specified, ptr = None
mr.map(nmap,mymap,ptr,addflag) # if addflag is specfied as a non-zero int,
			       #   new key/value pairs will be added to the
			       #   existing key/value pairs 
</PRE>
<PRE>mr.map_file(files,self,recurse,readfile,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,filename,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_file_char(nmap,files,recurse,readfile,sepchar,delta,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,str,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_file_str(nmap,files,recurse,readfile,sepstr,delta,mymap)
                             # files is a list of filenames and dirnames
			     # mymap is a function called back from the
			     #   library as mymap(itask,str,mr,ptr)
			     # as above, ptr and addflag are optional args
mr.map_mr(mr2,mymap)         # pass key/values in mr2 to mymap
                             # mymap is a function called back from the
			     #   library as mymap(itask,key,value,mr,ptr)
			     # as above, ptr and addflag are optional args 
</PRE>
<PRE>mr.open()
mr.open(addflag)
mr.print_screen(proc,nstride,kflag,vflag)
mr.print_file(file,fflag,proc,nstride,kflag,vflag) 
</PRE>
<PRE>mr.reduce(myreduce)         # myreduce is a function called back from the
			    #   library as myreduce(key,mvalue,mr,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key, mr is the MapReduce object,
			    #   and you (optionally) provide ptr (see below)
			    # your myreduce function should typically
			    #   make calls like mr->add(key,value)
mr.reduce(myreduce,ptr)     # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None 
</PRE>
<PRE>mr.scan_kv(myscan)          # myscan is a function called back from the
			    #   library as myscan(key,value,ptr)
			    #   for each key/value pair
			    #   and you (optionally) provide ptr (see below)
mr.scan_kv(myscan,ptr)      # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None 
</PRE>
<PRE>mr.scan_kmv(myscan)         # myscan is a function called back from the
			    #   library as myreduce(key,mvalue,ptr)
			    #   where mvalue is a list of values associated
			    #   with the key,
			    #   and you (optionally) provide ptr (see below)
mr.scan_kmv(myscan,ptr)     # if specified, ptr is any Python datum
			    #    and is passed back to your myreduce()
			    # if not specified, ptr = None 
</PRE>
<PRE>mr.scrunch(nprocs,key)
mr.sort_keys(mycompare)
mr.sort_values(mycompare)
mr.sort_multivalues(mycompare) # compare is a function called back from the
			       #   library as mycompare(a,b) where
			       #   a and b are two keys or two values
			       # your mycompare() should compare them
			       #   and return a -1, 0, or 1 
			       #   if a < b, or a == b, or a > b
mr.sort_keys_flag(flag)
mr.sort_values_flag(flag)
mr.sort_multivalues_flag(flag) 
</PRE>
<PRE>mr.kv_stats(level)
mr.kmv_stats(level) 
</PRE>
<PRE>mr.mapstyle(value)             # set mapstyle to value
mr.all2all(value)              # set all2all to value
mr.verbosity(value)            # set verbosity to value
mr.timer(value)                # set timer to value
mr.memsize(value)              # set memsize to value
mr.minpage(value)              # set minpage to value
mr.maxpage(value)              # set maxpage to value 
</PRE>
<PRE>mr.add(key,value)                 # add single key and value
mr.add_multi_static(keys,values)  # add list of keys and values
				  # all keys are assumed to be same length
				  # all values are assumed to be same length
mr.add_multi_dynamic(keys,values) # add list of keys and values
				  # each key may be different length
				  # each value may be different length 
</PRE>
<HR>

<P>Note that you can create multiple MR-MPI objects in your Python
script, and coordinate the data stored in each and moved between them,
just as can from a C or C++ program.
</P>
<P>The class methods above correspond one-to-one with the C++ methods
described <A HREF = "Interface_c++.html">here</A>, except that for C++ methods with
multiple interfaces (e.g. <A HREF = "map.html">map()</A>), there are multiple Python
methods with slightly different names, similar to the <A HREF = "Interface_c.html">C
interface</A>.
</P>
<P>There is no set function the the <I>keyalign</I> and <I>valuealign</I>
<A HREF = "settings.html">settings</A>.  These are hard-wired to 1 for
the Python interface, since no other values make sense, due to
the pickling/unpickling that is performed in key and value data.
</P>
<P>See the Python scripts in the examples directory for
<A HREF = "Examples.html">examples</A> of how these calls are made from a Python
program.  They are conceptually identical to the C++ and C programs in
the same directory.
</P>
</HTML>