ich habe eine mehr theoretische Frage zum verteilten Rechnen via Cluster, sprich man stelle sich vor, man hat ein sehr rechenintensives Programm und möchte nun die Last auf Knoten bzw. andere PC's im Netzwerk verteilen, welche möglichst einfach zu implementierende Bibliotheken gibt es da?
Bisher habe ich versucht, das mittels MPI for Python zu realisieren, allerdings hat immer nur einer der Knoten gerechnet oder ich bekam eine Fehlermeldung. Ich habe keine Ahnung warum es nicht richtig funtkioniert hat, ich habe mich dabei mit diversen Tutorials rumgeschlagen.
Hier ist der Code, den ich versucht habe, Last verteilt auszuführen, einfacher Monte Carlo Algorithmus
Code: Alles auswählen
import random
import numpy as np
from mpi4py import MPI
import time
def monte_carlo_pi(n_samples):
count = 0
for _ in range(n_samples):
x, y = random.random(), random.random()
dist = np.sqrt(x**2 + y**2)
count += dist <= 1
return 4 * count / n_samples
if __name__ == "__main__":
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
total_samples = 100_000_000
samples_per_process = total_samples // size
comm.Barrier()
start_time = time.time()
local_pi_estimate = monte_carlo_pi(samples_per_process)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time taken on process {rank}: {elapsed_time} seconds")
global_pi_estimate = comm.allreduce(local_pi_estimate, op=MPI.SUM) / size
print(f"Pi estimate: {global_pi_estimate}")
Das ist der Fehler den ich bekomme: PMIX ERROR: NOT-FOUND in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 181
Code: Alles auswählen
backend-gpu:~/py4mpi$ mpiexec -hostfile hosts.txt -n 16 --display-map python3 monte_carlo_mpi.py --mca plm_base_verbose 30
[server-backend-gpu:139171] mca: base: components_register: registering framework plm components
[server-backend-gpu:139171] mca: base: components_register: found loaded component rsh
[server-backend-gpu:139171] mca: base: components_register: component rsh register function successful
[server-backend-gpu:139171] mca: base: components_register: found loaded component isolated
[server-backend-gpu:139171] mca: base: components_register: component isolated has no register or open function
[server-backend-gpu:139171] mca: base: components_register: found loaded component slurm
[server-backend-gpu:139171] mca: base: components_register: component slurm register function successful
[server-backend-gpu:139171] mca: base: components_open: opening plm components
[server-backend-gpu:139171] mca: base: components_open: found loaded component rsh
[server-backend-gpu:139171] mca: base: components_open: component rsh open function successful
[server-backend-gpu:139171] mca: base: components_open: found loaded component isolated
[server-backend-gpu:139171] mca: base: components_open: component isolated open function successful
[server-backend-gpu:139171] mca: base: components_open: found loaded component slurm
[server-backend-gpu:139171] mca: base: components_open: component slurm open function successful
[server-backend-gpu:139171] mca:base:select: Auto-selecting plm components
[server-backend-gpu:139171] mca:base:select:( plm) Querying component [rsh]
[server-backend-gpu:139171] mca:base:select:( plm) Query of component [rsh] set priority to 10
[server-backend-gpu:139171] mca:base:select:( plm) Querying component [isolated]
[server-backend-gpu:139171] mca:base:select:( plm) Query of component [isolated] set priority to 0
[server-backend-gpu:139171] mca:base:select:( plm) Querying component [slurm]
[server-backend-gpu:139171] mca:base:select:( plm) Selected component [rsh]
[server-backend-gpu:139171] mca: base: close: component isolated closed
[server-backend-gpu:139171] mca: base: close: unloading component isolated
[server-backend-gpu:139171] mca: base: close: component slurm closed
[server-backend-gpu:139171] mca: base: close: unloading component slurm
[server-backend-gpu:139171] [[56867,0],0] plm:rsh: final template argv:
/usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "3726835712" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "3" -mca orte_node_regex "server-backend-gpu,[3:192].168.0.52,[3:192].168.0.24@0(3)" -mca orte_hnp_uri "3726835712.0;tcp://192.168.0.53:47097" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "3726835712.0;tcp://192.168.0.53:47097" -mca plm_base_verbose "30" -mca rmaps_base_display_map "1" -mca pmix "^s1,s2,cray,isolated"
[feldbus:227053] mca: base: components_register: registering framework plm components
[feldbus:227053] mca: base: components_register: found loaded component rsh
[feldbus:227053] mca: base: components_register: component rsh register function successful
[feldbus:227053] mca: base: components_open: opening plm components
[feldbus:227053] mca: base: components_open: found loaded component rsh
[feldbus:227053] mca: base: components_open: component rsh open function successful
[feldbus:227053] mca:base:select: Auto-selecting plm components
[feldbus:227053] mca:base:select:( plm) Querying component [rsh]
[feldbus:227053] mca:base:select:( plm) Query of component [rsh] set priority to 10
[feldbus:227053] mca:base:select:( plm) Selected component [rsh]
[server-backend:09192] mca: base: components_register: registering framework plm components
[server-backend:09192] mca: base: components_register: found loaded component rsh
[server-backend:09192] mca: base: components_register: component rsh register function successful
[server-backend:09192] mca: base: components_open: opening plm components
[server-backend:09192] mca: base: components_open: found loaded component rsh
[server-backend:09192] mca: base: components_open: component rsh open function successful
[server-backend:09192] mca:base:select: Auto-selecting plm components
[server-backend:09192] mca:base:select:( plm) Querying component [rsh]
[server-backend:09192] mca:base:select:( plm) Query of component [rsh] set priority to 10
[server-backend:09192] mca:base:select:( plm) Selected component [rsh]
[server-backend-gpu:139171] [[56867,0],0] complete_setup on job [56867,1]
Data for JOB [56867,1] offset 0 Total slots allocated 24
======================== JOB MAP ========================
Data for node: 192.168.0.52 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 3 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 4 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 5 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 6 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 7 Bound: N/A
Data for node: 192.168.0.24 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 8 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 9 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 10 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 11 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 12 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 13 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 14 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 15 Bound: N/A
=============================================================
Data for JOB [56867,1] offset 0 Total slots allocated 24
======================== JOB MAP ========================
Data for node: 192.168.0.52 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 3 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 4 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 5 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 6 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 7 Bound: N/A
Data for node: 192.168.0.24 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 8 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 9 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 10 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 11 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 12 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 13 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 14 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 15 Bound: UNBOUND
=============================================================
Data for JOB [56867,1] offset 0 Total slots allocated 24
======================== JOB MAP ========================
Data for node: 192.168.0.52 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 0 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 1 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 2 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 3 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 4 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 5 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 6 Bound: UNBOUND
Process OMPI jobid: [56867,1] App: 0 Process rank: 7 Bound: UNBOUND
Data for node: 192.168.0.24 Num slots: 8 Max slots: 0 Num procs: 8
Process OMPI jobid: [56867,1] App: 0 Process rank: 8 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 9 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 10 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 11 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 12 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 13 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 14 Bound: N/A
Process OMPI jobid: [56867,1] App: 0 Process rank: 15 Bound: N/A
=============================================================
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive update proc state command from [[56867,0],2]
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive got update_proc_state for job [56867,1]
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive update proc state command from [[56867,0],1]
[server-backend-gpu:139171] [[56867,0],0] plm:base:receive got update_proc_state for job [56867,1]
[feldbus:227053] PMIX ERROR: NOT-FOUND in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 181
[feldbus:227053] PMIX ERROR: NOT-FOUND in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 2571
[feldbus:227053] PMIX ERROR: NOT-FOUND in file ../../../src/server/pmix_server.c at line 2462
[server-backend:09192] *** Process received signal ***
[server-backend:09192] Signal: Segmentation fault (11)
[server-backend:09192] Signal code: Address not mapped (1)
[server-backend:09192] Failing at address: (nil)
[server-backend:09192] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f342bbb3520]
[server-backend:09192] [ 1] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_bfrops_base_pack_value+0x4b)[0x7f34292a1fdb]
[server-backend:09192] [ 2] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_bfrops_base_pack_kval+0x8f)[0x7f342929fe3f]
[server-backend:09192] [ 3] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_bfrops_base_pack+0x7f)[0x7f34292a2d6f]
[server-backend:09192] [ 4] /lib/x86_64-linux-gnu/libpmix.so.2(pmix_common_dstor_store+0x2c5)[0x7f342929d515]
[server-backend:09192] [ 5] /lib/x86_64-linux-gnu/libpmix.so.2(+0x9f9dc)[0x7f34292699dc]
[server-backend:09192] [ 6] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x1dee8)[0x7f342ba2cee8]
[server-backend:09192] [ 7] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f342ba2ebf7]
[server-backend:09192] [ 8] /lib/x86_64-linux-gnu/libpmix.so.2(+0x9c406)[0x7f3429266406]
[server-backend:09192] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f342bc05b43]
[server-backend:09192] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f342bc97a00]
[server-backend:09192] *** End of error message ***
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[56867,0],0] on node server-backend-gpu
Remote daemon: [[56867,0],1] on node 192.168.0.52
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[feldbus:227053] mca: base: close: component rsh closed
[feldbus:227053] mca: base: close: unloading component rsh
[server-backend-gpu:139171] mca: base: close: component rsh closed
[server-backend-gpu:139171] mca: base: close: unloading component rsh
Code: Alles auswählen
192.168.0.52 slots=8
192.168.0.24 slots=8
Code: Alles auswählen
mpiexec -hostfile hosts.txt -n 4 python3 monte_carlo_mpi.py
Vielleicht kann jemand helfen oder mir einen anderen Weg aufzeigen, würde mich freuen, Danke im Voraus!
Grüße