User Tools

Site Tools


bproc:bprocindepth

Introduction

Bproc is a Single System Image scheme for Linux clusters. It's primary focus is in unifying PID space and interprocess signaling. That is, each process running anywhere at all on the cluster will have a globally unique PID. Furthermore, all of those processes will appear in a ps done on the master. Interprocess signals will 'just work' even if the sending and recieving process are on different nodes in the cluster.

A key benefit to this is that usual sysadmin tasks such as process management 'just work' as if the cluster was a single large machine. Killing a process on the master will really kill it (since that involves a signal) no matter where it is actually running.

To complete the process management functionality, bproc adds the bproc_rfork call. The semantics are EXACTLY the same as fork except that a node number is provided. The child process will exist on the requested node. Thus far, the one exception to the transparency is that the child will find any open FDs closed. A future version of bproc may address that issue, but it IS a hard problem.

For the most part, bproc accomplishes it's magic through a combination of inter-node messaging, ghost processes on the master, and masqueraded PIDs on slaves.

Overview

A ghost process is a stub process on the master that represents the real process on a slave node. This means that any system call that might affect that process from another process must include hooks to translate those actions into messages to the node where the process really lives. Meanwhile, if a system call from the real process might affect other processes, it sends a request to it's ghost. The ghost stub includes a message queue that receives the request and carries it out on the master.

A masqueraded processs is a task_struct on a slave node representing a real process. However, in addition, the task struct contains a bproc struct specifying it's masquerade PID (the globally unique pid everyone sees). The syscall hooks on a slave node must either convert actions on a masqueraded PID to the real PID if the process is local, or must translate it into a message to the master otherwise. (the master will then direct it to the correct slave node).

The syscall hooks are set in place through a patch to the Linux kernel. The hooks are contingent on a define set in the kernel configuration.

Each hook supports a default (non)action when the bproc module is not inserted so that a BPROC kernel is perfectly useable in a standalone system as well.

The bproc module supplies functions for the various hooks and a multi-plexed syscall to interact with the bpmaster and bpslave userspace daemons. In part, userspace daemons are used for messaging in order to avoid deadlock prone in-kernel network messaging.

Until a bpmaster or bpslave daemon use the bproc syscall to connect into the bproc module, the various hooks do nothing at all in the kernel.

When the daemon connects, a synthetic file is created and passed to the process. That file is used both to send messages into the bproc module to affect processes and for the module to send messages down to the daemon for relay to other nodes in the cluster. Currently, the master and slave daemons use the epoll API to decide when they should read and write to the file.

Messages

Bproc messages are simply structured data consisting of a fixed header followed by message specific data. In addition, a message is contained by a request. The request contains bookkeeping information so that the message can be queued appropriatly and when necessary, waited on.

Most messages are actually addressed from and to a process ID. The PIDs used in the addressing are ALWAYS ghost and masq PIDS, never the real PID on a slave node. This assures that addresses are globally unique.

The message header is defined by:

struct bproc_message_hdr_t {
  uint16_t req;               /* Request type */
  uint8_t  fromtype, totype;
  int32_t  from;                      /* PID or node number */
  int32_t  to;                        /* PID or node number */
  uint32_t size;                      /* Request size */
  /* XXX FIX ME:  platform-independent-ize this */
  void *id;
  long result;
};

req

enum bproc_request_types {
  BPROC_MOVE=1,               /* The basic bproc request */
  BPROC_MOVE_COMPLETE,        /* move finished for vector type moves */
  BPROC_EXEC,
   
  /* Messages from ghost->real process */
  BPROC_FWD_SIG,
  BPROC_GET_STATUS,
   
  /* Remote system calls for real processes */
  BPROC_SYS_FORK,
  BPROC_SYS_KILL,
  BPROC_SYS_WAIT,
  BPROC_SYS_GETSID,
  BPROC_SYS_SETSID,
  BPROC_SYS_GETPGID,
  BPROC_SYS_SETPGID,
   
  /* Real process -> ghost notifications */
  BPROC_STOP,
  BPROC_WAIT,
  BPROC_CONT,
  BPROC_EXIT,
   
  BPROC_PARENT_EXIT,          /* ppid,oppid exited. */
  BPROC_CHILD_ADD,            /* ADD a child to a remote process */
  /*BPROC_CHILD_DEL,*/        /* REMOVE a child from a remote process */
  BPROC_PGRP_CHANGE,          /* somebody changed your pgrp */
  BPROC_PTRACE,               /* Normal ptrace syscall stuff */
  BPROC_REPARENT,             /* Update parent pointer on a process. */
  BPROC_SET_CREDS,
  BPROC_ISORPHANEDPGRP,
   
  /* System control/status */
  BPROC_VERSION,
  BPROC_NODE_CONF,            /* Node configuration message */
  BPROC_NODE_PING,            /* Ping message for my own keepalive... */
  BPROC_NODE_DOWN,            /* Node was set to down */
  BPROC_NODE_EOF,             /* EOF message */
  BPROC_NODE_RECONNECT,       /*  */
   
  BPROC_NODE_CHROOT,          /* Ask a slave daemon to chroot */
   
  BPROC_NODE_REBOOT,          /* Ask node to reboot */
  BPROC_NODE_HALT,            /* Ask node to halt */
  BPROC_NODE_PWROFF,          /* Ask node to power off */
};

A reply to any req will be the same value | 0x8000 (high bit of uint16 set).

fromtype

Not used

totype

#define BPROC_ROUTE_REAL  1
#define BPROC_ROUTE_NODE  2
#define BPROC_ROUTE_GHOST 3

The totype assists in determining the purpose of the message.

BPROC_ROUTE_REAL

Nominally routed to a real process. In practice, will be handled by bproc on the real process's behalf.

BPROC_ROUTE_NODE

A global request that will not change the status of a particular process (though PWROFF and such certainly will affect processes!).

BPROC_ROUTE_GHOST

A request from a slave node to the master to update a ghost's status or have the gost perform a global operation on the slave process' behalf. A slave node should never receive a BPROC_ROUTE_GHOST message.

In kernel

Hooks

The bproc patch adds include/linux/bproc.h containing some clever macros to conditionally call into the bproc module iff it is loaded and the relevant condition is true. The whole include is wrapped around an ifdef CONFIG_BPROC so that if the feature is not selected, the macros are defined to NULL actions (all if_* return false, the function callls define to do{}while(0);)

For the below, tsk = a task struct, func is the function name which will call the function bproc_hook_<func> in the bproc module, args will be passed to the hook function, defl will be the value returned when bproc is not installed or the condition of the hook is not met.

For ease of reading, the bproc_hook macro will include one or more suffixes to specify conditional calls and such. Even 'unconditional' calls only happen if the bproc module is inserted

bproc_ismasq(tsk)

Returns true if tsk is a real process with a ghost on the master

bproc_isghost(tsk)

Returns true if tsk is a ghost process

bproc_hook(func,args)

Call bproc_hook_func_hook with args unconditionally

bproc_hook_im(func,args)

Call the hook iff the current process is masqueraded (that is, if it has a corresponding ghost process on the master)

bproc_hook_imr(func,args)

return the return value of the hook iff the current process is masqueraded. This is useful, for example, to replace an entire system call with a remote system call on the master. It defines out to

if(bproc_ismasq(current))
    return bproc_hook_func_hook(args);
bproc_hook_v(defl,func,args)

Call hook. Substitute defl if bproc is not present. Commonly used to 'translate' a masquerade pid to the real pid on a slave.

syscall(pid) {
    pid = bproc_hook_v(pid, masq2real, (pid));
    ...

If bproc is not present, it will be the equivilant of an identity:

    pid = pid;

Otherwise, it will be like:

    pid = bproc_hook_masq2real_hook(pid);
bproc_hook_imv(defl,func,args)

Same as above, but additionally contingant on the current process being masqueraded.

Messaging API

In the bproc module, every message needs a bproc_krequest_t and a bproc_message_hdr_t or one of the more specific bproc_*_msg_t structs (all of which include a bproc_message_hdt_t as the first element).

First, a request is created using bproc_new_req passing in a type (such as BPROC_FWD_SIG), the size of the message in the request and the memory allocation mode (usually GFP_KERNEL but occasionally GFP_ATOMIC is necessary when sleeping can't be allowed)

  req = bproc_new_req(type, sizeof(*msg), GFP_KERNEL);

A void pointer to the message portion of a request can be retrieved with bproc_msg:

  msg = (struct bproc_rsyscall_msg_t *)bproc_msg(req);    

The msg is then filled in appropriately for the particular request. The header is generally filled in by a few helper functions. bpr_to_ghost addresses the message to a ghost process with the specified PID

  bpr_to_ghost(msg, current->bproc.pid);

bpr_from_real sets the return address from a real process with a give pid (generally the same as the ghost)

  
  bpr_from_real(msg, current->bproc.pid);

Once filled in, the message is sent with bproc_send_req

  if ((r = bproc_send_req(&m->req, req)) != 0) return r;

Once sent, optionally, we can wait for a response. When that is done, the process sleeps just as it would when blocking on any condition. Like any sleep inthe kernel, locks should not be held except in the most exceptional and carefully thought out circumstances (just don't do it!).

  req->flags |= BPROC_REQ_WANT_RESP;
  r = bproc_response_wait(req, MAX_SCHEDULE_TIMEOUT, interruptible);
  while (r == -EINTR && bproc_pending(req)) {
      /* Deal with a signal */
      masq_forward_signal(m);
      r = bproc_response_wait(req, MAX_SCHEDULE_TIMEOUT, interruptible);
  }

The response message can the be accessed as:

  resp = bproc_msg(req->response);

In all cases, all allocated messages must eventually be freed to avoid leaking memory:

  bproc_put_req(req);

Adding finctionality

As an example, we look at the practical case of adding support for renice to masqueraded processes. Before adding this into bproc-4.1-pre1, doing renice 1 <pid> on the master has no effect on the running process. Doing it on the slave reports no such process.

The former happens because the master sets the new priority/nice value on the ghost process and the real process never knows. On the slave node, since there is no bproc handling in sys_setpriority (or sys_getpriority), it simply doesn't fine the masquerade PID at all.

First, an analysis. Since sys_[get|set]priority may operate on process groups or all processes owned by a user or on a single process on a different node, when called on the slave, it must be handled by the ghost process (using a remote system call) since only the master knows about all processes that may be affected.

That will take care of the second case, now for the first. In this case, we can not simply call into [get|set]priority on the slave for several reasons. The simplest is that the master has already checked security on the call and found it to be allowed, so there's little point to re-checking. Next, since it could affect process groups and group leaders, computing how to divide that by node is not a plesant thought.

Tracing through what happens when sys_setpriority is called, we find it eventually calls set_user_nice. Since set_user_nice is a void function, we conclude that it “can't fail” or at least if it does, there's nothing useful to do about it. SO, the master will hanle set_user_nice calls on a ghost process by sending a notification/command to the slave node and NOT expecting a reply.

All of the above implies that we need 3 new bproc_hooks. sys_setpriority, sys_getpriority and set_user_nice.

First, add appropriate declarations in include/linux/bproc_hooks.h

bprocdeclhook(long,  sys_setpriority, (int which, int who, int niceval));
bprocdeclhook(long,  sys_getpriority, (int which, int who));

bprocdeclhook(void,  set_user_nice,   (struct task_struct *p, long nice));

Note that the declarations are grouped by the kernel source file that uses them (in the comments), be sure to place the new declarations logically!

Now, we patch set_user_nice in kernel/sched.c. Before modification, we have:

void set_user_nice(struct task_struct *p, long nice)
{
      struct prio_array *array;
      int old_prio, delta;
      unsigned long flags;
      struct rq *rq;

      if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
              return;


      /*
       * We have to be careful, if called from sys_setpriority(),
       * the task might be in the middle of scheduling on another CPU.
       */
      rq = task_rq_lock(p, &flags);
      /*
       * The RT priorities are set via sched_setscheduler(), but we still
       * allow the 'normal' nice value to be set - but as expected
       * it wont have any effect on scheduling until the task is
       * not SCHED_NORMAL/SCHED_BATCH:
       */
     ...

Since we WANT the change to affect the ghost as well as the real process (at the very least so the process table contains correct information), we just add the notification inside an if (bproc_isghost) and then allow the rest to run as normal:

void set_user_nice(struct task_struct *p, long nice) {

      struct prio_array *array;
      int old_prio, delta;
      unsigned long flags;
      struct rq *rq;
      if (TASK_NICE(p) == nice || nice < -20 || nice > 19)
              return;
      if(bproc_isghost(p)) {
              bproc_hook(set_user_nice, (p, nice));
      }
      /*
       * We have to be careful, if called from sys_setpriority(),
       * the task might be in the middle of scheduling on another CPU.
       */
      rq = task_rq_lock(p, &flags);
      /*
       * The RT priorities are set via sched_setscheduler(), but we still
       * allow the 'normal' nice value to be set - but as expected
       * it wont have any effect on scheduling until the task is
       * not SCHED_NORMAL/SCHED_BATCH:
       */

Now, we need to set sys[get|set]priority up to fully replac their code with a remote system call when (and only when) called by a masqueraded process. This will allow the master or a standalone system as well as processes on the slave that do not take part in the system image (like bpslave, various kernel threads, and company) to run normally, and and will maintain the system image otherwise.

bproc_hook_imr (if masq, return) is perfect for this.

asmlinkage long sys_getpriority(int which, int who) {

      struct task_struct *g, *p;
      struct user_struct *user;
      long niceval, retval = -ESRCH;

      if (which > 2 || which < 0)
              return -EINVAL;

      bproc_hook_imr(sys_getpriority, (which, who));  // do this on the master
      read_lock(&tasklist_lock);
      switch (which) {

asmlinkage long sys_setpriority(int which, int who, int niceval) {

      struct task_struct *g, *p;
      struct user_struct *user;
      int error = -EINVAL;

      if (which > 2 || which < 0)
              goto out;

      /* normalize: avoid signed division (rounding problems) */
      error = -ESRCH;
      if (niceval < -20)
              niceval = -20;
      if (niceval > 19)
              niceval = 19;

      bproc_hook_imr(sys_setpriority, (which, who, niceval)); // do this on the master

      read_lock(&tasklist_lock);
      switch (which) {

While we're at it, we must make sure the symbols for these functions are available to modules since we'll be calling them on the master:

EXPORT_SYMBOL_GPL(sys_setpriority)
EXPORT_SYMBOL_GPL(sys_getpriority)

Note we export for GPL modules only. Some exports leave off the b_GPL part, but there is a legal issue there. _GPL is always OK since the whole kernel is GPL, but allowing proprietary modules to use internal symbols is a major exception that should be decided on the linux kernel mailing list. In any event, bproc is GPL, so it's not an issue here.

That takes care of the kernel, go ahead and compile and install it.

NOTE: Whenever bproc_hooks.h is changed you MUST make clean, then build. Make will not correctly detect and rebuild the dependancies otherwise. The kernel will boot on the master but slaves will mysteriously crash!

The kernel is now ready, but bproc itself has no idea what to do about this (it won't even compile now that we declare hooks it has never heard of). So we switch to the bproc source then cd kernel to get to the module code.

A good first step is to stub the hooks in so we can make sure we get called when we want to be and only when we want to be.

Edit hooks.c and add

long bproc_hook_sys_setpriority(int which, int who, int niceval) {
      // do remote syscall to the ghost, SMJ
      printk("<1>bproc_hook_set_priority not yet implemented\n");
      return -ENOSYS;
}

long bproc_hook_sys_getpriority(int which, int who) {
      // do remote syscall to the ghost, SMJ
      printk("<1>bproc_hook_get_priority not yet implemented\n");
      return -ENOSYS;
}

void bproc_hook_set_user_nice(struct task_struct *p, long nice) {
    printk("<1>bproc: set_user_nice not yet implemented\n");
}

Again, try to place the functions in a logical place, not just append them to the end.

In most cases, this will be fine since the syscalls were failing anyway for masqued processes and the hooks won't affect other processes at all. There may be a few cases where stubbing a function out like this could make the slaves useless though, so consider carefully.

make and install the bproc module then reboot the slaves if desired. Exercize the system and look at dmesg to see that your hooks are being called. In this case, renice 2 <some PID on a slave> and bpsh <n> renice 2 <some masqued process> shows that the hooks are called as expected.

Now to add the actual functionality to bproc. We'll start with set_user_nice since that alone will show results. Once done, renice on the master will actually renice the process on the slave, but renice run on the slave will still fail. That state of affairs might be good enough for most cases (since there is little reason to run renice on a slave and very few programs call setpriority at all).

First, edit bproc.h to add the request type at the end of bproc_request_types: enum bproc_request_types {

  BPROC_MOVE=1,               /* The basic bproc request */
  BPROC_MOVE_COMPLETE,        /* move finished for vector type moves */
  BPROC_EXEC,

  ...

  BPROC_NODE_PWROFF,          /* Ask node to power off */

  /* new to 4.2 */
  BPROC_SET_USER_NICE,
};

new requests should be at the end since otherwise enum values change and a minor version mismatch between slave and master will crash horribly rather than mostly work.

Now for a message struct to carry the new request:

struct bproc_usernice_msg_t {
  struct bproc_message_hdr_t hdr;
  long        nice;
};

Note that all *_msg_t structs must start with a bproc_message_hdr_t. Many functions count on that positioning and will crash otherwise. We add a nice field to hold the nice value. PID is not specified since that information is already available in hdr's to field. (that is, we will send the message to the masqed process).

Now, we'll implement the master's side of things in hooks.c

void bproc_hook_set_user_nice(struct task_struct *p, long nice) {
  struct bproc_krequest_t *req;
  struct bproc_usernice_msg_t *msg;

  req = bproc_new_req(BPROC_SET_USER_NICE, sizeof(*msg), GFP_KERNEL);
  if (!req) return;
  msg = (struct bproc_usernice_msg_t *) bproc_msg(req);
  bpr_to_real(msg, p->pid);
  bpr_from_ghost(msg, p->pid);

  msg->nice = nice;

  bproc_send_req(&bproc_ghost_reqs, req);
  bproc_put_req(req);
//    printk("<1>bproc: set_user_nice not yet implemented\n");
}

That's it for the master. It will dutifly send notification whenever a ghost process's nice value is changed. Of course, we must now implement the slaves side of things. Otherwise, it will simply ignore the messages and nothing will really happen.

In the slave's module, messages are handled in slave.c in bproc_slave_write. This is the function that is called when the bpslave userspace daemon writes to it's special file.

That function is broadly organized into sections to handle responses, (particularly responses to remote syscalls where a process is sleeping on a wait queue and needs to wake up with the response linked to the corresponding request), then messages routed to real processes ( BPROC_ROUTE_REAL, this is our case), and finally messages routed to the node itself (BPROC_ROUTE_NODE). Additional cases exist to report errors for BPROC_ROUTE_GHOST messages (since a slave NEVER has ghost processes on it) and unknown routing type errors. This is reasonably handled by an if else and then nested switch structures.

In our case, we need to add a case to the switch inside the BPROC_ROUTE_REAL case:

           ...
              }
              break;
          case BPROC_SET_USER_NICE:
              ret = masq_user_nice_change(m, (struct bproc_usernice_msg_t *) hdr);
              break;
          case BPROC_PTRACE:
              ptrace_3rd_party(req, m);
           
            ...  

By convention, functions that affect masqed processes have masq_ prepended to their names and are implemented in masq.c As there are useful helper functions there and conventions are good, we do the same. Now, for the implementation of masq_user_nice_change:

int masq_user_nice_change(struct bproc_masq_master_t *m,
                   struct bproc_usernice_msg_t *msg) {
  struct task_struct *task;
  read_lock(&tasklist_lock);
  task = masq_find_task_by_pid(m, msg->hdr.to);
  if (task) {
      set_user_nice(task, msg->nice);
  }
  read_unlock(&tasklist_lock);
  return task ? 0 : -ESRCH;
} 

When we look at set_user_nice in the kernel proper, we see that it expects to be called with the tasklist_lock taken, so we make sure to do that here.

Now, we want to test this new functionality. To do this, we MUST rebuild the entire bproc package, not just the module. The bproc system includes some version magic, and so bpslave, libbpslave.a, bpmaster, and bproc.ko must come from the same build or they will report an error. Further, since Nimbus slaves use beoboot, which links in libbpslave.a rather than a standalone bpslave, that package must also be rebuilt and updated. That done, do beoboot -p (or -e if you're using LinuxBIOS), then bpctl -S allup -R to get the slaves booting. Finally, update the master with /etc/init.d/clustermatic restart so the master will be running the new version by the time the slave comes back up.

Now, to test:

bpsh 1 ps afxl

F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
5     0   318   316  16   0  2320  632 -      R    ?          0:00 ps afxl
1     0 32759     1  18   0  1580  508 -      S    ?          0:00 /usr/bin/mond -p 2709

ps afxl |grep 32759

0     0   321 26975  18   0  3720  640 pipe_w S+   pts/2      0:00          \_ grep 32759
1     0 32759     1  19   0  1580  508 ghost_ Ss   ?          0:00 [mond]

renice 2 32759

bpsh 1 ps afxl

F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
5     0   325   323  16   0  2320  632 -      R    ?          0:00 ps afxl
1     0 32759     1  17   2  1580  508 -      SN   ?          0:00 /usr/bin/mond -p 2709

ps afxl |grep 32759

0     0   327 26975  21   0  3720  640 pipe_w S+   pts/2      0:00          \_ grep 32759
1     0 32759     1  21   2  1580  508 ghost_ SNs  ?          0:00 [mond]

Note the nice value on the slave: SUCCESS!!

Now, to complete the picture, we add the remote syscalls for sys_[get|set]priority

Ghost processes spend all of their time in a big while loop in ghost_thread (a kernel thread) in ghost.c. That thread dequeues requests posted from the bpmaster daemon.

In turn, requests pass to ghost_handle_request. That processing is quite similar to the slave processing except that only BPROC_ROUTE_GHOST will reach it. In it's big switch structure, part-way down, we find cases for each possible remote system call. ghost processes handle all remote system calls since only a process makes system calls.

Bproc has infrastructure in place to make simple syscalls simple. bproc_rsyscall_msg_t is pre-defined to carry up to 6 long arguments and a set of signals to block (irrelevant here). bproc_null_response (in spite of it's name) is ready made to send a long result code back to the calling process (it even takes care of addressing the message for us).

In our case, we will reply with bproc_null_response (which in spite of it's name, returns a result code as a long). Further, since response messages are linked to requests and ghost thread handles putting the request, we don't even have to manage memory. We can just do:

  case BPROC_SYS_GETPRIORITY: {
      struct bproc_rsyscall_msg_t *msg = (struct bproc_rsyscall_msg_t *)hdr;
      bproc_null_response(&bproc_ghost_reqs, req,
                          sys_getpriority(msg->arg[0], msg->arg[1]));
      } break;

  case BPROC_SYS_SETPRIORITY: {
      struct bproc_rsyscall_msg_t *msg = (struct bproc_rsyscall_msg_t *)hdr;
      bproc_null_response(&bproc_ghost_reqs, req,
                          sys_setpriority(msg->arg[0], msg->arg[1], msg->arg[2]));
      } break;

and, of course, add BPROC_SYS_GETPRIORITY and BPROC_SYS_SETPRIORITY to bproc_request_types.

That takes care of the master, but we now need appropriate hooks to make the remote syscall in the first place. As this is done for masqueraded processes, the hooks go in hooks.c and the implementation is in masq.c

int masq_sys_setpriority(int which, int who, int niceval) {
  int result;
  struct bproc_krequest_t     *req;
  struct bproc_rsyscall_msg_t *msg;
  struct bproc_null_msg_t     *resp_msg;

  /* The check for local PIDs (including our own) is done in the
   * hook code. */
  req = bpr_rsyscall1(BPROC_SYS_SETPRIORITY);
  if (!req) return -ENOMEM;
  msg = (struct bproc_rsyscall_msg_t *)bproc_msg(req);
  msg->arg[0] = which;
  msg->arg[1] = who;
  msg->arg[2] = niceval;
  if (bpr_rsyscall2(BPROC_MASQ_MASTER(current), req, 0)) {
      bproc_put_req(req);
      return -EIO;
  }
  resp_msg = bproc_msg(req->response);
  result = resp_msg->hdr.result;
  bproc_put_req(req);
  return result;
}

int masq_sys_getpriority(int which, int who) {
  int result;
  struct bproc_krequest_t     *req;
  struct bproc_rsyscall_msg_t *msg;
  struct bproc_null_msg_t     *resp_msg;

  /* The check for local PIDs (including our own) is done in the
   * hook code. */
  req = bpr_rsyscall1(BPROC_SYS_GETPRIORITY);
  if (!req) return -ENOMEM;
  msg = (struct bproc_rsyscall_msg_t *)bproc_msg(req);
  msg->arg[0] = which;
  msg->arg[1] = who;
  if (bpr_rsyscall2(BPROC_MASQ_MASTER(current), req, 0)) {
      bproc_put_req(req);
      return -EIO;
  }
  resp_msg = bproc_msg(req->response);
  result = resp_msg->hdr.result;
  bproc_put_req(req);
  return result;
}

And to hook them in:

long bproc_hook_sys_setpriority(int which, int who, int niceval) {
      return masq_sys_setpriority(which, who, niceval);
      // do remote syscall to the ghost, SMJ
//      printk("<1>bproc_hook_set_priority not yet implemented\n");
//      return -ENOSYS;
}

long bproc_hook_sys_getpriority(int which, int who) {
      return masq_sys_getpriority(which, who);
      // do remote syscall to the ghost, SMJ
//      printk("<1>bproc_hook_get_priority not yet implemented\n");
//      return -ENOSYS;
}

To be neat about it, add the function declarations to bproc_internals.h:

int   masq_user_nice_change(struct bproc_masq_master_t *m, struct bproc_usernice_msg_t *msg);
int   masq_sys_setpriority(int which, int who, int niceval);
int   masq_sys_getpriority(int which, int who);

A quick final note here. We did NOT alter sys_nice even though it seems relevant. This is simply because it already works just fine! sys_nice always affects the current process, so isn't at all confused by masqueraded PIDs. The master always fetches the status of ghosts from the relevant node's real process, so when nice affects the real process's priority, it shows up on the master just fine.

In userspace

bpmaster and bpslave are userspace daemons running on the master and slaves respectively. They relieve the bproc module of dealing with most of the difficulties associated with deadlock free networking in kernel code. In order to make sure the system remains responsive, they set themselves to realtime priority and sleep polling their connections.

bpmaster

Bpmaster is essentially a big connection multiplexor and process mapper. It has a master file connection to the bproc module in kernel, a listening socket, and accepted connections from every bpslave in the cluster. It accepts new connections and messages from both the kernel and from bpslave processes. It maintains a map of real PIDs and the associated node and socket. An incoming message will update it's map of the cluster. Some messages end there but most must then be relayed either to another node or to the kernel through the master fd.

BPROC_ROUTE_GHOST messages all go to the kernel while BPROC_ROUTE_REAL will go to the node where the process resides (which MAY be the kernel on the master, but probably not).

bpslave

The bpslave is a somewhat stripped down version. It accepts messages from it's master, handles some itself (such as chroot, reboot, etc) and passes anything else to the kernel. Likewise some messages from the kernel may be handled locally but most are sent to the bpmaster. Bpslave never communicates directly with another slave.

bproc/bprocindepth.txt · Last modified: 2010/04/15 21:19 (external edit)