Kool Aid Served Daily
Setting up some backup clone gates
One of my colleagues had a gate wiped on a disk crash (using ZFS, but only 1 disk). That got people wondering about our backup strategy. We have distributed developers, so we've banked on them having private copies. Oh, and we also push directly to OpenSolaris.
So now I'm looking at adding clones in two remote sites - on machines I don't own. And by that, I mean that I can't create a local gatekeeper user account. Which means that I'll have ssh trust issues. I.e., this will not work:
changegroup.2 = /usr/bin/hg push -R /pool/ws/nfs41-gate ssh://cistern.central//export/ws/th199096/nfs41-cloneI could have a cron job pull changes, but besides being gross, there is a slight chance it will not grab something. No, I want the push to the gate to still copy to the clones.
The answer is to mimic Dave Marker's code for updaing OpenSolaris (it is updateoso.py in the onnv-gk-tools repository seen at ssh://anon@hg.opensolaris.org/hg/scm-migration/onnv-gk-tools). Before I do that though, there is the issue that I'll need local copies of that repository as well. I could use the one that is in use for onnv, but then again, if I mess things up, Dave might recall that Oklahoma has a no helmet law and come over to visit. Much safer to create my own copy.
The biggest issue was making sure to add the new key to the two machines. I've got this set up and running.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Locking down a gate/respository
So I'm not a mean gatekeeper - I try to not lock down our gate. I feel like I should just be able to ask people not to integrate anything. But the reality is that you can't count on everyone getting the message. So, you need to lock down the gate.
I knew how to do that in Teamware, but I've never had that need with Mercurial. Until today that is - a nasty branch merge with onnv_103.
So to lock down a Mercurial gate with Sun's extensions, you can use lock.py:
[nfs4hg@aus1500-home ~]> which lock.py /pool/nfs4hg/bin/lock.py [nfs4hg@aus1500-home nfs41-gate]> lock.py -n -R /pool/ws/nfs41-gate [write]: NoneOkay, no one has a write lock, so let's grab one:
[nfs4hg@aus1500-home nfs41-gate]> lock.py -R /pool/ws/nfs41-gate [nfs4hg@aus1500-home nfs41-gate]> lock.py -n -R /pool/ws/nfs41-gate [write]: nfs4hg th199096By the way, where is this configured?
[nfs4hg@aus1500-home ~]> grep lock /pool/ws/nfs41-gate/.hg/hgrc # 5. lockdir must be readable by whomever will pull/push lockdir = public/lock wlock = nfs4hg, th199096 # These hooks check the lock before anything happens. prechangegroup.0 = python:hook.lockchk.lockchk # then comment it out, let them push, uncomment, and unlock the gate.So the lock only works on the gate and not the clone. You can find more about this in the source of lock.py.
Ohh, and even though I haven't tested it, you release the gate easily enough with a unlock.py.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Debugging an array
My kernel is dying trying to pull apart an XDR decoded array. And I've got an array of two items, so how do I view the second one?
[0]> ffffff0010c3a5f0::print DS_REPORTAVAILargs { ds_id = 0x64 ds_verifier = 0xffffff02efe3b060 ds_addrs = { ds_addrs_len = 0x1 ds_addrs_val = 0xffffff0307106368 } ds_attrvers = 1 (DS_ATTR_v1) ds_storinfo = { ds_storinfo_len = 0x2 ds_storinfo_val = 0xffffff0304545a80 } }We can easily see the first one:
[0]> 0xffffff0304545a80::print ds_storinfo { type = 1 (ZFS) ds_storinfo_u = { zfs_info = { guid_map = { ds_guid = { stor_type = 1 (ZFS) ds_guid_u = { zfsguid = { zfsguid_len = 0x10 zfsguid_val = 0xffffff02ff6de768 } } } mds_sid_array = { mds_sid_array_len = 0 mds_sid_array_val = 0 } } attrs = { attrs_len = 0x1 attrs_val = 0xffffff04c8fd6578 } } } }But how do I get to the second one? I.e., how do I tell if it is just a pointer size away or if some special packing is going on?
I can use the array dcmd:
[0]> 0xffffff0304545a80::array ds_storinfo 2 ffffff0304545a80 ffffff0304545ac0And that shows us:
[0]> ffffff0304545ac0::print ds_storinfo { type = 1 (ZFS) ds_storinfo_u = { zfs_info = { guid_map = { ds_guid = { stor_type = 1 (ZFS) ds_guid_u = { zfsguid = { zfsguid_len = 0x10 zfsguid_val = 0xffffff0304def2e0 } } } mds_sid_array = { mds_sid_array_len = 0 mds_sid_array_val = 0 } } attrs = { attrs_len = 0x1 attrs_val = 0xffffff04bc8f62a0 } } } }Which looks valid and we can quickly test:
[0]> 0xffffff04bc8f62a0::print ds_zfsattr { attrname = { utf8string_len = 0x4 utf8string_val = 0xffffff0508e88760 } attrvalue = { attrvalue_len = 0x15 attrvalue_val = 0xffffff050853f270 } } [0]> 0xffffff050853f270::dump -w 2 \/ 1 2 3 4 5 6 7 8 9 a b c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f v123456789abcdef0123456789abcdef ffffff050853f270: 706e6673 2d392d32 343a6461 7461322f 706e6673 32bbddba cefaed fe 98140000 pnfs-9-24:data2/pnfs2...........So either it is valid or it just happens to point to what I expect!
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
VMware: Cannot find VMnet0, starting disconnected
So I can't start my virtual machines for VMware:
Nov 18 09:52:25.157: vcpu-0| [msg.vnet.getLastMessage] VMnet0: The system cannot find the file specified Nov 18 09:52:25.157: vcpu-0| [msg.device.startdisconnected] Virtual device Ethernet0 will start disconnected.But this time I know what has to be going on. Recently I upgraded my 6.5 installation and I turned my machine off while I was out of town. And of course when I came back, the keyboard did not work (error 39?). It turned out that VMware installs a shim on top of the keyboard driver and I think the uninstall hosed me. I followed the directions at code 39 keyboard win XP. Let me tell you, entering text into regedit32 with a dead keyboard is a challenge. But the entry by Bas13 does walk you through it.
I also deleted the network configurations.
So I searched for the error messages today and mostly turned up Linux related questions.
I decided to reinstall the latest VMware upgrade and when I was prompted, I selected Repair as an option. That fixed the issue for me.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Being a family IT support guy
My mother in law's computer is heavily infected. Why, why not?
I've been using BleepingComputer to figure out how to clean it up.
As it wouldn't boot as far as I was concerned, I took out the hard drive and put in a USB enclosure. I then attached it to a laptop I was willing to reformat if necessary. I then ran a virus scanner and Spybot - Search and Destroy on it. When I thought it was clean enough, I got the PC to boot again.
And now I'm going through online tools to scan it again and again. I'll get each tool to report nothing and start a new tool up. Right now I'm working down the list on Preparation Guide For Use Before Posting A Hijackthis Log, Instructions for receiving help in cleaning your computer and I'm doing Ad-Aware 2008 Free. And it is half done with 528 infections found. ;<.
So it finally finishes all of the way. A hint is to not start a browser up before running Ad-Aware. It will find offenses in that case.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
The Business of Blogging
Sometimes an inflight magazine actually has something of real interest to me, but I don't want to take the paper copy with me. I just want the one article. On a recent United Airlines flight, I came across this The Business of Blogging by Ethan G. Salwen. I've written about what I like to blog about and how to drive up hits (simple - write articles), but I was struck by two of Ethan's rules:
- Write 2 - 3 times a week.
- Provide 3 external links.
The first really restates my write articles rule.
The second deals with how external search engines will rank your blog. If you are linking out, your entry is taken to be more interesting.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
PANDORA
I bought PANDORA as an app for my iPhone. You enter a song/group/composer that you like and it presents you with music that matches the style.
I'd like a filter on it such that I could find age appropriate music for my son.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Said Syed's OKCOSUG talk
Just got back from my first Oklahoma City OpenSolaris User Group meeting. It was fun. Said gave a presentation on sizing provisioning for attaching storage to VMware's ESX. It provided a good overview on VMotion and SVMotion.
But of more interest to me was Said's interest in serving the customer. Not only did he try to shape the presentation to those in the audience, he wanted to learn how to convey information better in his blog. I hope I was able to help.
One thing that I learned about customer service and blogging from him was that he flat out told the audience, if you have a question, post a comment in one of my blog entries and I'll get back to you. I.e., he is flipping the push model of the author delivering in blogging to a pull model of the reader driving content. I was floored by this and came away wondering how to have a free floating request section such that comment fields stay true to the blog article and new content can be driven.
Also of interest was the audience member who basically asked when was Sun's xVM going to support NFSv4.1. We don't even have it close to shipping and already people want it in configurations we aren't thinking about!
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Just saw an interesting integration float by
I'm being a complete OpenStorage fanboy, I just saw the integration notice for 6760398 Moving NDMP to open source go by and without even knowing exactly what it entailed, I was happy.
So happy I downloaded the Mecurial source onto my Linux server to make sure I could already see the change:
[thud@adept onnv-gate]> hg history | more changeset: 7917:5c4442486198 tag: tip user: Reza Sabdar <Reza (dot) Sabdar (at) Sun (dot) COM> date: Thu Oct 23 11:42:48 2008 -0700 summary: 6760398 Moving NDMP to open sourceAnother big step for OpenStorage:
Author: Reza Sabdar <Reza (dot) Sabdar (at) Sun (dot) COM> Repository: /export/onnv-gate Total changesets: 1 Changeset: 5c4442486198 Comments: 6760398 Moving NDMP to open source Files: added: usr/src/cmd/ndmpadm/Makefile usr/src/cmd/ndmpadm/ndmpadm.h usr/src/cmd/ndmpadm/ndmpadm_main.c usr/src/cmd/ndmpadm/ndmpadm_print.c usr/src/cmd/ndmpd/LICENSE usr/src/cmd/ndmpd/LICENSE.descrip usr/src/cmd/ndmpd/Makefile usr/src/cmd/ndmpd/include/bitmap.h usr/src/cmd/ndmpd/include/cstack.h usr/src/cmd/ndmpd/include/ndmpd_door.h usr/src/cmd/ndmpd/include/ndmpd_prop.h usr/src/cmd/ndmpd/include/tlm.h usr/src/cmd/ndmpd/include/tlm_buffers.h usr/src/cmd/ndmpd/include/traverse.h usr/src/cmd/ndmpd/ndmp.xml usr/src/cmd/ndmpd/ndmp/Makefile.rpcgen usr/src/cmd/ndmpd/ndmp/ndmp.x usr/src/cmd/ndmpd/ndmp/ndmpd.h usr/src/cmd/ndmpd/ndmp/ndmpd_callbacks.c usr/src/cmd/ndmpd/ndmp/ndmpd_chkpnt.c usr/src/cmd/ndmpd/ndmp/ndmpd_comm.c usr/src/cmd/ndmpd/ndmp/ndmpd_common.h usr/src/cmd/ndmpd/ndmp/ndmpd_config.c usr/src/cmd/ndmpd/ndmp/ndmpd_connect.c usr/src/cmd/ndmpd/ndmp/ndmpd_data.c usr/src/cmd/ndmpd/ndmp/ndmpd_door.c usr/src/cmd/ndmpd/ndmp/ndmpd_dtime.c usr/src/cmd/ndmpd/ndmp/ndmpd_fhistory.c usr/src/cmd/ndmpd/ndmp/ndmpd_handler.c usr/src/cmd/ndmpd/ndmp/ndmpd_log.c usr/src/cmd/ndmpd/ndmp/ndmpd_log.h usr/src/cmd/ndmpd/ndmp/ndmpd_main.c usr/src/cmd/ndmpd/ndmp/ndmpd_mark.c usr/src/cmd/ndmpd/ndmp/ndmpd_mover.c usr/src/cmd/ndmpd/ndmp/ndmpd_prop.c usr/src/cmd/ndmpd/ndmp/ndmpd_scsi.c usr/src/cmd/ndmpd/ndmp/ndmpd_tape.c usr/src/cmd/ndmpd/ndmp/ndmpd_tar.c usr/src/cmd/ndmpd/ndmp/ndmpd_tar3.c usr/src/cmd/ndmpd/ndmp/ndmpd_util.c usr/src/cmd/ndmpd/svc-ndmp usr/src/cmd/ndmpd/tlm/tlm_backup_reader.c usr/src/cmd/ndmpd/tlm/tlm_bitmap.c usr/src/cmd/ndmpd/tlm/tlm_buffers.c usr/src/cmd/ndmpd/tlm/tlm_hardlink.c usr/src/cmd/ndmpd/tlm/tlm_info.c usr/src/cmd/ndmpd/tlm/tlm_init.c usr/src/cmd/ndmpd/tlm/tlm_lib.c usr/src/cmd/ndmpd/tlm/tlm_proto.h usr/src/cmd/ndmpd/tlm/tlm_restore_writer.c usr/src/cmd/ndmpd/tlm/tlm_traverse.c usr/src/cmd/ndmpd/tlm/tlm_util.c usr/src/cmd/ndmpstat/Makefile usr/src/cmd/ndmpstat/ndmpstat_main.c usr/src/lib/libndmp/Makefile usr/src/lib/libndmp/Makefile.com usr/src/lib/libndmp/amd64/Makefile usr/src/lib/libndmp/common/libndmp.c usr/src/lib/libndmp/common/libndmp.h usr/src/lib/libndmp/common/libndmp_base64.c usr/src/lib/libndmp/common/libndmp_door_data.c usr/src/lib/libndmp/common/libndmp_error.c usr/src/lib/libndmp/common/libndmp_prop.c usr/src/lib/libndmp/common/llib-lndmp usr/src/lib/libndmp/common/mapfile-vers usr/src/lib/libndmp/i386/Makefile usr/src/lib/libndmp/sparc/Makefile usr/src/lib/libndmp/sparcv9/Makefile modified: usr/src/Makefile.lint usr/src/cmd/Makefile usr/src/lib/Makefile usr/src/pkgdefs/SUNWndmpu/Makefile usr/src/tools/opensolaris/license-list Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
snoop doesn't want to decode the DS_EXIBIargs
I'm getting a signal when I use my snoop on a ds to mds packet trace:
[th199096@jhereg snoop]> ./snoop -V -i ~/ds2tmds.snoop > xxx WARNING: received signal 11 from packet 4And packet 4 is:
4 0.00596 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_EXIBII've gone from handcoding the XDR to generating it automatically, and both had this error. Time to see what is going on. I generated the packet trace with '-x0,2000' so I get to see the output:
0: 001b 242d e629 001b 242d e641 0800 4500 ..$-.)..$-.A..E. 16: 00a8 d3dd 4000 4006 0000 0a01 e943 0a01 ....@.@.....�C.. 32: e944 03fc 0801 1f9e 8f80 17cc 1d01 5018 �D............P. 48: c1e8 0000 0000 8000 007c d058 4d4c 0000 .�.......|.XML.. 64: 0000 0000 0002 0001 9641 0000 0001 0000 .........A...... 80: 0002 0000 0001 0000 0020 48f4 fa0b 0000 ......... H..... 96: 0009 706e 6673 2d39 2d32 3500 0000 0000 ..pnfs-9-25..... 112: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 128: 0000 ffff ff02 eef0 3b00 0000 0027 706e ........;....'pn 144: 6673 2d39 2d32 353a 2073 7663 3a2f 6e65 fs-9-25: svc:/ne 160: 7477 6f72 6b2f 6473 6572 763a 6465 6661 twork/dserv:defa 176: 756c 743a 0000 ult:..I'm going to go back and forth in the code to look at this. First, I've tracked down where the FMRI is appearing:
case NFS4_SETPORT: uaddr = get_uaddr(nconf, addr); if (uaddr == NULL) { dserv_log(do_all_handle, LOG_INFO, gettext("NFS4_SETPORT: get_uaddr failed")); return (1); } (void) strlcpy(setportargs.dsa_uaddr, uaddr, sizeof (setportargs.dsa_uaddr)); (void) strlcpy(setportargs.dsa_proto, nconf->nc_proto, sizeof (setportargs.dsa_proto)); (void) strlcpy(setportargs.dsa_name, getenv("SMF_FMRI"), sizeof (setportargs.dsa_name));This is in usr/src/cmd/dserv/dservd/tbind_sup.c and is a dservd call into the kernel. It ends up in usr/src/uts/common/dserv/dserv_mds.c: (minus some unpacking)
int dserv_mds_addport(const char *uaddr, const char *proto, const char *aname) { ... (void) sprintf(in, "%s: %s:", uts_nodename(), aname); inst->dmi_name = dserv_strdup(in); bzero(&res, sizeof (res)); args.ds_ident.boot_verifier = inst->dmi_verifier; args.ds_ident.instance.instance_len = strlen(inst->dmi_name) + 1; args.ds_ident.instance.instance_val = inst->dmi_name;The defaults are also set:
dserv_mds_instance_init(dserv_mds_instance_t *inst) { inst->dmi_ds_id = 0; inst->dmi_mds_addr = NULL; inst->dmi_mds_netid = NULL; inst->dmi_verifier = (uintptr_t)curthread; inst->dmi_teardown_in_progress = B_FALSE; }So, if we knew the curthread, we could spot check to see that this went across okay. We also need to know if this has to be unique or not. If so, could we get a dup here?
So how does this data go across the wire? We need to look in the XDR (usr/src/head/rpcsvc/ds_prot.x):
struct identity { ds_verifier boot_verifier; opaque instance; }; /* * DS_EXIBI - Exchange Identity and Boot Instance * * ds_ident : An identiifier that the MDS can use to distinguish * between data-server instances. */ struct DS_EXIBIargs { identity ds_ident; };So we see the boot_verifier followed by the instance. BTW: MAXPATHLEN might be too small here as we add the nodename.
And an opaque is a length and an array. Hmm, the hand-coded usr/src/cmd/cmd-inet/usr.sbin/snoop/nfs4_xdr.c calls xdr_opaque, while the machine generated code does:
? bool_t xdr_identity(XDR *xdrs, identity *objp) { rpc_inline_t *buf; if (!xdr_ds_verifier(xdrs, &objp->boot_verifier)) return (FALSE); if (!xdr_bytes(xdrs, (char **)&objp->instance.instance_val, (u_int *) &objp->instance.instance_len, MAXPATHLEN) return (FALSE); return (TRUE); }And that makes a difference:
[th199096@jhereg snoop]> ./snoop -v -i ~/ds2tmds.snoop > xxx [th199096@jhereg snoop]>But wait, we don't see the signal, but we do see:
CTL-DS: ----- Sun CTL-DS ----- CTL-DS: CTL-DS: Proc = 2 (Exchange Identity and Boot Instance) CTL-DS: ---- short frame ---And a debug statement shows that the length looks off:
CTL-DS: ----- Sun CTL-DS ----- CTL-DS: CTL-DS: Proc = 2 (Exchange Identity and Boot Instance) CTL-DS: xdr_identity bombed, len = 0 CTL-DS: ---- short frame ---Hmm, I manually set the length before the call to xdr_opaque. So back to the raw data. We know right before the nodename, we should find the length.
Hmm, my allergies are killing my thought process. A signal 11 is SIGSEGV.
I'm back after a night's rest. I recompiled snoop with gcc and I think I've found the problem after staring at it in gdb:
127 switch (xdrs->x_op) { 128 case XDR_DECODE: 129 if (nodesize == 0) 130 return (TRUE); 131 if (sp == NULL) (gdb) 132 *cpp = sp = (char *)mem_alloc(nodesize); 133 /* FALLTHROUGH */ 134 135 case XDR_ENCODE: 136 sprintf(get_line(0, 0), "tdh_xdr_bytes calling xdr_opaque with %d!", nodesize); 137 return (xdr_opaque(xdrs, sp, nodesize)); 138 139 case XDR_FREE: 140 if (sp != NULL) { 141 mem_free(sp, nodesize); (gdb) p sp $9 = 0x80c74a6 "\203�\020\203}\020" (gdb)We need to be allocating memory here. But whatever sp is pointing to is junk:
bool_t xdr_identity(XDR *xdrs, identity *objp) { rpc_inline_t *buf; if (!xdr_ds_verifier(xdrs, &objp->boot_verifier)) { sprintf(get_line(0, 0), "xdr_identity bombed for verifier = %d", objp->boot_verifier); return (FALSE); } sprintf(get_line(0, 0), "xdr_identity okay for verifier = %lx", objp->boot_verifier); if (!tdh_xdr_bytes(xdrs, (char **)&objp->instance.instance_val, (u_int *) &objp->instance.instance_len, MAXPATHLEN)) { sprintf(get_line(0, 0), "xdr_identity bombed, len = %d", objp->instance.instance_len); return (FALSE); } return (TRUE); }And we can see I am just grabbing it off the stack:
static void ds_exibi_sa(char *line) { DS_EXIBIargs eargs; if (!xdr_DS_EXIBIargs(&xdrm, &eargs)) longjmp(xdr_err, 1); sprintf(line, "V = %d I = (%.20s)", eargs.ds_ident.boot_verifier, utf8localize((utf8string *)&eargs.ds_ident.instance)); xdr_free(xdr_DS_EXIBIargs, (char *)&eargs); }A quick memset and retest:
[th199096@jhereg snoop]> ./snoop -v -i ~/ds2mds2.snoop > zzz WARNING: received signal 11 from packet 4 [th199096@jhereg snoop]> ./snoop -v -i ~/ds2mds2.snoop > zzz [th199096@jhereg snoop]>And we can see the difference!
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Another common task for Python
I'm in the midst of debugging a snoop implementation and I wanted to recompile with gcc and use gdb. I saved the output from the make command and basically used vi to put each .o file on a single line:
[th199096@jhereg snoop]> more files.make nfs4_xdr.o snoop.o snoop_aarp.o snoop_adsp.o snoop_aecho.o snoop_apple.oNote that I could strip off the '.o's manually, but typically I would leave them there. What I want to do is take the filename and use it twice in command. I.e.
% gcc -g -c -o nfs4_xdr.o nfs4_xdr.cSo I decided to use Python to learn a bit more about it:
[th199096@jhereg snoop]> more tran.py #!/usr/sfw/bin/python l1 = [] print "#!/bin/sh -x" print "# Make no changes here, machine generated!" print "rm snoop *.o" for line in open("files.make"): [name, ext] = line.split('.') print "gcc -g -DUSE_FOR_SNOOP -c -I/builds/th199096/snoop/proto/root_i386/usr/include" \ " -I. -I/builds/th199096/snoop/usr/src/common/net/dhcp -o %s.o %s.c" % (name, name) l1.append(name) print 'gcc -g -DTEXT_DOMAIN="SUNW_OST_OSCMD" -D_TS_ERRN -Bdirect -o snoop ', for name in l1: print "%s.o" % (name), print "-L/builds/th199096/snoop/proto/root_i386/lib -L/builds/th199096/snoop/proto/root_i386/usr/lib" \ " -ldhcputil -ldlpi -lsocket -lnsl -ltsol"One thing that jumped out was since I threw away the 'ext', I didn't have to worry abotu stripping off the '\n'. I also made use of the ',' on the end of the print statements to keep a line going.
I liked the ease of adding to the list of filenames. And in general, I found it easy to make a quick change and retest.
Could I have done this another way, say with the Makefile? Sure, but it wouldn't have been a learning experience. And off I go, the gdb prompt is calling me!
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Another Linux Dynamic Pseudo Root patch submitted by Steve Dickson of Redhat
So another patch set for a Linux Dynamic Pseudo Root has been submitted by Steve Dickson to the Linux NFSv4 mailing list:
The following patch series gives rpc.mountd the ability to allocate a dynamic pseudo root, so the 'fsid=0' export option is no longer required. This allows v2, v3 and v4 clients mounts without any changes to the server's exports list. One anomaly of the Linux NFS server is that it requires a pseudo root to be defined. Currently the only way a pseudo root can be defined is by setting the fsid to zero (i.e. fsid=0). So if we wanted to make v4 the default mounting version and have things just work like v2/v3 all of the existing exports configurations would have to change (i.e. a 'fsid=0' would have to be added) to support a v4 mounts, which, imho, is unacceptable. So this patch series address this problem.Steve has really highlighted a huge gap in seamless integration of the Linux NFSv4 implementation into automounters, etc. The path to an export should not change based on the version of the protocol.
Hmm, strike that, from a re-reading of I'm not sure if this patch eliminates my concern about the mount path or not. I.e., above he talks about adding a 'fsid=0' on the server and not what the client has to do about the path.
Time to ask him!
Update: Steve says it does address the mount path issue we've seen in the past.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Barebones framework in place for snoop!
I just got a framework in place for a table driven approach for snoop to decode the Control Protocol used between our DS and MDS servers for pNFS.
Here you can see the DS talking to the MDS and the MDS sending a NULL query:
[th199096@pnfs-9-25 ~]> sudo snoop.ctl -i /root/ds2tmds.snoop | grep CTL 4 0.00596 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_EXIBI 6 0.00000 pnfs-9-26.Central.Sun.COM -> pnfs-9-25 CTL-DS R DS_EXIBI 8 0.00009 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-DS C DS_REPORTAVAIL 13 0.00000 pnfs-9-26.Central.Sun.COM -> pnfs-9-25 CTL-MDS C MDS_NULL 15 2.99097 pnfs-9-25 -> pnfs-9-26.Central.Sun.COM CTL-MDS R MDS_NULL 17 0.00012 pnfs-9-26.Central.Sun.COM -> pnfs-9-25 CTL-DS R DS_REPORTAVAILGuess I'll have to file in the guts of this to see what is really being said here!
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Trying to examine snoop output for DS to MDS communication
I decided to do a massive snoop session as I brought 2 DSes online with 2 storage pools each. I wanted to see the transactions go across the wire. Then I found out that snoop doesn't know about our control protocol. Guess what I'm working on now?
Anyway, I have a special kernel that just sends along the new ds_path information - no spe. I decided I could at least look in the kernel and see what I had:
> ::walk mds_DS_guid_entry_cache | ::print struct rfs4_dbe data | ::print -a ds_guid_info_t { ffffff0315aebc18 dbe = 0xffffff0315aebba8 ffffff0315aebc20 ds_ownerp = 0xffffff0315aede48 ffffff0315aebc28 ds_guid_next = { ffffff0315aebc28 list_next = 0 ffffff0315aebc30 list_prev = 0 } ffffff0315aebc38 ds_guid = { ffffff0315aebc38 stor_type = 1 (ZFS) ffffff0315aebc40 ds_guid_u = { ffffff0315aebc40 zfsguid = { ffffff0315aebc40 zfsguid_len = 0x10 ffffff0315aebc48 zfsguid_val = 0xffffff02f0d4aca8 } } } ffffff0315aebc50 ds_attr_len = 0 ffffff0315aebc58 ds_attr_val = 0 ffffff0315aebc60 ds_path = { ffffff0315aebc60 utf8string_len = 0 ffffff0315aebc68 utf8string_val = 0 } } { ffffff0315aebd10 dbe = 0xffffff0315aebca0 ffffff0315aebd18 ds_ownerp = 0xffffff0315aede48 ffffff0315aebd20 ds_guid_next = { ffffff0315aebd20 list_next = 0 ffffff0315aebd28 list_prev = 0 } ffffff0315aebd30 ds_guid = { ffffff0315aebd30 stor_type = 1 (ZFS) ffffff0315aebd38 ds_guid_u = { ffffff0315aebd38 zfsguid = { ffffff0315aebd38 zfsguid_len = 0x10 ffffff0315aebd40 zfsguid_val = 0xffffff02f0c9f1c0 } } } ffffff0315aebd48 ds_attr_len = 0 ffffff0315aebd50 ds_attr_val = 0 ffffff0315aebd58 ds_path = { ffffff0315aebd58 utf8string_len = 0 ffffff0315aebd60 utf8string_val = 0 } } { ffffff0315aebe08 dbe = 0xffffff0315aebd98 ffffff0315aebe10 ds_ownerp = 0xffffff0315aedf58 ffffff0315aebe18 ds_guid_next = { ffffff0315aebe18 list_next = 0 ffffff0315aebe20 list_prev = 0 } ffffff0315aebe28 ds_guid = { ffffff0315aebe28 stor_type = 1 (ZFS) ffffff0315aebe30 ds_guid_u = { ffffff0315aebe30 zfsguid = { ffffff0315aebe30 zfsguid_len = 0x10 ffffff0315aebe38 zfsguid_val = 0xffffff02f0c9f0a8 } } } ffffff0315aebe40 ds_attr_len = 0 ffffff0315aebe48 ds_attr_val = 0 ffffff0315aebe50 ds_path = { ffffff0315aebe50 utf8string_len = 0 ffffff0315aebe58 utf8string_val = 0 } } { ffffff0315aebf00 dbe = 0xffffff0315aebe90 ffffff0315aebf08 ds_ownerp = 0xffffff0315aedf58 ffffff0315aebf10 ds_guid_next = { ffffff0315aebf10 list_next = 0 ffffff0315aebf18 list_prev = 0 } ffffff0315aebf20 ds_guid = { ffffff0315aebf20 stor_type = 1 (ZFS) ffffff0315aebf28 ds_guid_u = { ffffff0315aebf28 zfsguid = { ffffff0315aebf28 zfsguid_len = 0x10 ffffff0315aebf30 zfsguid_val = 0xffffff02f398d790 } } } ffffff0315aebf38 ds_attr_len = 0 ffffff0315aebf40 ds_attr_val = 0 ffffff0315aebf48 ds_path = { ffffff0315aebf48 utf8string_len = 0 ffffff0315aebf50 utf8string_val = 0 } } >And no path strings. At least I have 4 entries, which I was expecting!
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Data structures for DS_REPORTAVAIL
A DS communicates the data server storage associated with it to the MDS. (We'll look at this in more depth later.) It does that via an RPC call -- DS_REPORTAVAIL. Here are the associated structures used for the XDR:
Notice that ds_path has been added for the SPE to allow for human readable names to be mapped to guids.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
I've been meaning to learn more about RBAC
If you also have been meaning to learn more about RBAC, a good start would be: Introducing pfexec, a Convenient Utility in the OpenSolaris OS By Joerg Moellenkamp, with contributions from Marina Sum, October 13, 2008.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
ds_addr_t is now da_addrlist_t
As a group, we decided to change ds_addr_t to ds_addrlist_t to avoid confusion with struct ds_addr. The OpenSolaris gate has those changes already.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Restarting with mds_gather_devs
Time to pick back up on that analysis, but remembering that ds_addr is different than ds_addr_t.
mds_gather_devsNote, we are in usr/src/uts/common/fs/nfs/nfs41_state.c...
So mds_gather_devs does the work of stuffing the layout. It gets called for every entry found in the instp->ds_addr_tab:
968 ds_addr_t *dp = (ds_addr_t *)entry; ... 974 if (gap->dex max_devs_needed) { 975 gap->lo_arg.lo_devs[gap->dex] = rfs4_dbe_getid(dp->dbe); 976 gap->dev_ptr[gap->dex] = dp; 977 gap->dex++; 978 }So we keep on reading ds_addr_t data structures until we have enough.
Now, how is that table populated? We are looping over these entries in the NFSv4 state tables:
1060 rw_enter(&instp->ds_addr_lock, RW_READER); 1061 rfs4_dbe_walk(instp->ds_addr_tab, mds_gather_devs, &args); 1062 rw_exit(&instp->ds_addr_lock);So we need to look for instp->ds_addr_tab or instp->ds_addr_idx. And in usr/src/uts/common/fs/nfs/ds_srv.c, we find mds_ds_addr_update which does:
616 ds_status 617 mds_ds_addr_update(ds_owner_t *dop, struct ds_addr *dap) 618 { 619 struct mds_adddev_args darg; 620 bool_t create = FALSE; 621 ds_addr_t *devp; ... 626 if ((devp = (ds_addr_t *)rfs4_dbsearch(mds_server->ds_addr_uaddr_idx, 627 (void *)dap->addr.na_r_addr, 628 &create, NULL, RFS4_DBS_VALID)) != NULL) { 629 MDS_SET_DS_FLAGS(devp->dev_flags, dap->validuse); 630 rw_exit(&mds_server->ds_addr_lock); 631 return (stat); 632 }Note how we are calling the ds_addr_t a devp, perhaps a better structure name might be ds_dev_addr_t.
So, if we find one in mds_server->ds_addr_tab (via the mds_server->ds_addr_uaddr_idx which is a secondary index to ds_addr_idx), then we return. Else:
636 darg.dev_netid = kstrdup(dap->addr.na_r_netid); 637 darg.dev_addr = kstrdup(dap->addr.na_r_addr); 638 639 /* make it */ 640 devp = (ds_addr_t *)rfs4_dbcreate(mds_server->ds_addr_idx, 641 (void *)&darg); 642 643 if (devp) { 644 devp->ds_owner = dop; 645 MDS_SET_DS_FLAGS(devp->dev_flags, dap->validuse); 646 list_insert_tail(&dop->ds_addr_list, devp); 647 } else 648 stat = DSERR_INVAL;we grab the info out of the ds_addr and create a new entry. Note that it is devp->ds_owner which is likely to have the addressing info I am interested in.
98 typedef struct { 99 rfs4_dbe_t *dbe; 100 time_t last_access; 101 char *identity; 102 ds_id ds_id; 103 ds_verifier verifier; 104 uint32_t dsi_flags; 105 list_t ds_addr_list; 106 listhttp://opensolaris.org/os/project/nfsv41/documentation/nfsv41_server/d13_layout_devices.jpg_t ds_guid_list; 107 } ds_owner_t;So we have lists of ds_addr and ds_guid. But that ds_guid_list is currently only created and never populated.
Time to digress and attack this from a different angle.
Looking at the NFSv4.1 pNFS Devices and File Layout StructuresThis may no longer be accurate, but Robert Gordon, before he passed on (to another company), left us with this image (from Server Design Document):
This says quite clearly that while it may be the spe's job to generate layouts, in order to do so you need to construct a device list. Up until now, I've been working on a month's old statement that I need to "just generate the stripe width, stripe unit size, and an array of guids". Implicit in that is that someone else would do the logic, because it was trivial, to morph that into a layout.
And you know, I keep on looking for an explicit mapping to occur between the selection of the layout and the device list - it is the title of this series of blog articles. It may not be occurring because of the maturity of the code. I.e., everything up to now is predicated on there being a fixed number of DSes and fixed number of data server storage. And relationships just work in that if you only have 1 entry in a list because there is only 1 data store, then all of the other associated lists will also only have 1 entry.
There is still a lot of work to do to make this implementation a product.
Anyway, the picture spells out a lot of what is in the spec. The other way to attack this would be to look at a snoop trace during a create.
But anyway you slice it, there is no magic happening to tie a guid to a device list.
I'm going to have to expand the scope of my project.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
Understanding layout creation to understand what spe will have to do
In my task list for spe, a large item has been how to tie it into the current code base - you might have seen me reference it as translating data path to guid. To do that, I've had to understand what the current code is doing and the limitations in that code. I've also had to question exactly what it is we want done.
Quick overview of speThe Simple Policy Engine (spe) tells the pNFS MetaData Server (MDS) how to layout the stripes on the data servers (DS) at file creation time. If you think of RAID, a file is striped across disks and we need to know how many disks it is striped across and what is the width of the stripe. Then to determine which disk a particular piece of data is on, we can divide the file offset by the stripe width to get the disk.
This is simplistic, but is also the basic concept behind layout creation in pNFS. A huge difference is that we need to tell the client not only the stripe count and width, but the machine addresses of the DSes. It is a little bit more complex than that as each DS might have several data stores associated with it, a data store might be moved to a different DS, etc. We capture that complexity in the globally unique id (guid) assigned to each data store. But conceptually, lets consider just the base case of each DS having only one data store and it is always on that DS.
Overview of Current Layout GenerationSo the NFSv4.1 protocol defines an OPEN operation and a LAYOUTGET operation. It doesn't define how an implementation will determine which data sets are put into the layout.
In the current OpenSolaris implementation, these two operations result in the following call chains:
"OPEN" -> mds_op_open -> mds_do_opennull -> mds_create_file "LAYOUTGET" -> mds_op_layout_get -> mds_fetch_layout -> mds_get_flo -> fake_speIn my development gate, a call to spe_allocate currently occurs in mds_create_file.
The relevant files to look at are: usr/src/uts/common/fs/nfs/nfs41_srv.c and usr/src/uts/common/fs/nfs/nfs41_state.c.
Note: I will be quoting routines in the above two files. Over time, those files will change and will not match up what I quote.
mds_fetch_layoutThe interesting stuff in layout creation occurs in mds_fetch_layout:
Note that we have starting with nfs41_srv.c.
8320 if (mds_get_flo(cs, &lp) != NFS4_OK) 8321 return (NFS4ERR_LAYOUTUNAVAILABLE); mds_get_floAnd in mds_get_flo:
8269 mutex_enter(&cs->vp->v_lock); 8270 fp = (rfs4_file_t *)vsd_get(cs->vp, cs->instp->vkey); 8271 mutex_exit(&cs->vp->v_lock); 8272 8273 /* Odd.. no rfs4_file_t for the vnode.. */ 8274 if (fp == NULL) 8275 return (NFS4ERR_LAYOUTUNAVAILABLE);Which basically states that the file must have been created and in memory. These is not a panic for at least the following reasons:
- Client may have sent the LAYOUTGET before the OPEN. A crappy thing to do, but not a reason for a panic.
- The server may have rebooted since the client sent the OPEN. Even if the file is on disk on the MDS, it is not incore. Clue the client in that they may need to reissue the OPEN.
Note that an odl is a on-disk layout. And the statement on 8278 is how I will tie the spe in with this code. During an OPEN, I can simply set fp->flp and bypass this logic. If there is any error, then this field will be NULL and we can grab a simple default layout here. So I'll probably rename fake_spe to be mds_generate_default_flo.
fake_speSo understanding what fake_spe does will help me understand what the real spe will have to do:
8236 int key = 1; ... 8241 *flp = NULL; 8242 8243 rw_enter(&instp->mds_layout_lock, RW_READER); 8244 lp = (mds_layout_t *)rfs4_dbsearch(instp->mds_layout_idx, 8245 (void *)(uintptr_t)key, &create, NULL, RFS4_DBS_VALID); 8246 rw_exit(&instp->mds_layout_lock); 8247 8248 if (lp == NULL) 8249 lp = mds_gen_default_layout(instp, mds_max_lo_devs); 8250 8251 if (lp != NULL) 8252 *flp = lp;The current code only ever has 1 layout in memory. Hence, the key is 1. We'll need to see how that layout is generated. And that occurs in mds_gen_default_layout. Note how simplistic this code is - if for any reason the layout is deleted from the table, it is simply added back in here. Right now, the only reason the layout would be deleted is if a DS reboots (look at ds_exchange in ds_srv.c).
mds_gen_default_layoutThis is the code builds up the layout and stuffs it in memory:
Note that we have switched into nfs41_state.c.
1046 int mds_default_stripe = 32; 1047 int mds_max_lo_devs = 20; ... 1052 struct mds_gather_args args; 1053 mds_layout_t *lop; 1054 1055 bzero(&args, sizeof (args)); 1056 1057 args.max_devs_needed = MIN(max_devs_needed, 1058 MIN(mds_max_lo_devs, 99)); 1059 1060 rw_enter(&instp->ds_addr_lock, RW_READER); 1061 rfs4_dbe_walk(instp->ds_addr_tab, mds_gather_devs, &args); 1062 rw_exit(&instp->ds_addr_lock); 1063 1064 /* 1065 * if we didn't find any devices then we do no service 1066 */ 1067 if (args.dex == 0) 1068 return (NULL); 1069 1070 args.lo_arg.loid = 1; 1071 args.lo_arg.lo_stripe_unit = mds_default_stripe * 1024; 1072 1073 rw_enter(&instp->mds_layout_lock, RW_WRITER); 1074 lop = (mds_layout_t *)rfs4_dbcreate(instp->mds_layout_idx, 1075 (void *)&args); 1076 rw_exit(&instp->mds_layout_lock);We first walk across the instp->ds_addr_tab and look for effectively 20 entries. Note that max_devs_needed is always 20 for this code and so will be args.max_devs_needed.
I think the check on 1067 is incorrect and a result of the current implementation normally being on a community with 1 DS. It should be the case that args.dex is greater than or equal to max_devs_needed. Actually, we need to be passing in how many devices we will have D (the ones assigned to a policy) and how many we need to use S, with S <= D. The args.dex will have to be >= S.
Note that on 1070, we assign it the only layout id which will ever be generated. And if we play things right, we could store this layout id back in the policy and avoid regenerating the layout if at all possible.
Finally we stuff the newly created layout into the table.
mds_gather_devsSo mds_gather_devs does the work of stuffing the layout. It gets called for every entry found in the instp->ds_addr_tab:
974 if (gap->dex max_devs_needed) { 975 gap->lo_arg.lo_devs[gap->dex] = rfs4_dbe_getid(dp->dbe); 976 gap->dev_ptr[gap->dex] = dp; 977 gap->dex++; 978 }So we keep on reading ds_addr_t data structures until we have enough.
Now, how is that table populated? You can look for ds_addr_idx over in usr/src/uts/common/fs/nfs/ds_srv.c, but basically, for each data store that a DS registers, one of these is created.
The upshot of all this is that if a pNFS community has N data stores, then the layout generated for the current implementation will have a stripe count of N.
Back to mds_fetch_layoutNote and nfs41_srv.c.
Okay, we've generated the layout and start to generate the otw (over the wire) layout:
8332 8333 mds_set_deviceid(lp->dev_id, &otw_flo.nfl_deviceid); 8334Crap, it is sending the device id across the wire! I'm going to have to rethink my approach. Instead of storing a policy as a device list and picking which devices I want out of that list (i.e., a Round Robin (RR) scheduler), I'm going to have to store each generated set as a new device list.
I don't understand the process like I thought I did.
Going back to mds_gather_devs, it is not stuffing data stores into a table as I thought. Instead, it is stuffing DS network addesses into a table.
Missing linkWhat I'm missing is how the ds_addr entries map back to data stores. Okay, this code in mds_gen_default_layout does it:
mds_layout_lock, RW_WRITER); 1074 lop = (mds_layout_t *)rfs4_dbcreate(instp->mds_layout_idx, 1075 (void *)&args); 1076 rw_exit(&instp->mds_layout_lock);We have just gotten the device list via the walk over mds_gather_devs. And now we effectively call mds_layout_create on 1074.
1104 ds_addr_t *dp; 1105 struct mds_gather_args *gap = (struct mds_gather_args *)arg; 1106 struct mds_addlo_args *alop = &gap->lo_arg; ... 1119 lp->layout_type = LAYOUT4_NFSV4_1_FILES; 1120 lp->stripe_unit = alop->lo_stripe_unit; 1121 1122 for (i = 0; alop->lo_devs[i] && i devs[i] = alop->lo_devs[i]; 1124 dp = mds_find_ds_addr(instp, alop->lo_devs[i]); 1125 /* lets hope this doesn't occur */ 1126 if (dp == NULL) 1127 return (FALSE); 1128 gap->dev_ptr[i] = dp; 1129 }Okay, alop->lo_devs is the array we built in mds_gather_devs. Yes, yes, that is true.
I just figured out where all of my confusion is coming from - the code has struct ds_addr and ds_addr_t. In the xdr code, struct ds_addr is just an address (usr/src/head/rpcsvc/ds_prot.x):
338 /* 339 * ds_addr - 340 * 341 * A structure that is used to specify an address and 342 * its usage. 343 * 344 * addr: 345 * 346 * The specific address on the DS. 347 * 348 * validuse: 349 * 350 * Bitmap associating the netaddr defined in "addr" 351 * to the protocols that are valid for that interface. 352 */ 353 struct ds_addr { 354 struct netaddr4 addr; 355 ds_addruse validuse; 356 };But in the code I've been looking at, ds_addr_t is a different structure (see usr/src/uts/common/nfs/mds_state.h):
133 /* 134 * ds_addr: 135 * 136 * This list is updated via the control-protocol 137 * message DS_REPORTAVAIL. 138 * 139 * FOR NOW: We scan this list to automatically build the default 140 * layout and the multipath device struct (mds_mpd) 141 */ 142 typedef struct { 143 rfs4_dbe_t *dbe; 144 netaddr4 dev_addr; 145 struct knetconfig *dev_knc; 146 struct netbuf *dev_nb; 147 uint_t dev_flags; 148 ds_owner_t *ds_owner; 149 list_node_t ds_addr_next; 150 } ds_addr_t;This is pure evil because we typically equate foo_t as being typedef struct foo foo_t. As you can see, I've been fighting that in the above analysis.
I'm going to file an issue on this naming convention and leave the analysis here. I'll come back to it and rewrite it as if I knew all along that I was using a ds_addr_t and not a struct ds_addr.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
nfs41-gate is branch merged with snv_100
I just merged the nfs41-gate with the snv_100 tagged onnv-gate. This caused me to bump the closedv tag to 2 in the nfs41-gate.
You can refresh your copies of our closed-bins at http://www.opensolaris.org/os/project/nfsv41/downloads/.
BTW: While the pushes are automatic, I'm still trying to get the notification to be automatic.
Originally posted on Kool Aid Served DailyCopyright (C) 2008, Kool Aid Served Daily
