Wednesday, July 05, 2006

It's the end of WinFS as we know it, and I feel fine!

So Microsoft went and killed WinFS again, a few days back. I really can't say I'm surprised. Databases and filesystems are conceptually similar, and the idea of unifying the two is certainly attractive. From a purely theoretical standpoint, anyway. As with most attractive ideas, someone else thought of it years before it showed up in Redmond. In early versions of BeOS, the "filesystem" was a database, but that turned out to be way too slow, so they dropped the idea. A general-purpose database is never going to be as fast as a traditional filesystem, and maybe that's what happened to WinFS. We may never know the full story, and I'm not sure it matters anyway.

Not only am I not surprised, I'm pleased as punch. In my RL endeavors over the years, I've spent an inordinate amount of time working around all the odd quirks and foibles of Windows filesystems. Ok, quirks and foibles is being very kind. Maybe it's better to call it a disgusting, oozing mass of scabs and scar tissue. How's that for vivid? Every time a new version of windows comes out, the complexity multiplies, so for the sake of readable, maintainable code, and for the sake of my own professional sanity, I have to firmly oppose anything that would add more special cases.

If you're writing a Windows app and want it to be really robust, especially if it needs to read and write to random user-specified parts of the filesystem, here are a few of the fun special cases you'll have to deal with, just off the top of my head:

  • I've already mentioned the new fun that crops up on x64 Windows and its goofy filesystem redirection misfeatures, so I won't repeat the rant here.
  • Long paths can be a problem. Windows defines the constant MAX_PATH as something like 260 characters, and that's often enough in the real world. But this constant is sort of misleading, in that each individual component of the full path can be that long. It's fairly easy to confirm this yourself with a little effort, and it's not hard to create a few nested directories with long names, such that the fully qualified path is greater than MAX_PATH in length. But if you try passing a long path like this to Win32 file functions (CreateFile, for instance), you're in for a rude shock. Sure, it's a valid path, but it's too long, so Windows errors out. Oh, but this is Windows, and there's always an obscure workaround if you just hunt around in MSDN long enough. And sure enough, if you preface your jumbo-sized path with the magic incantation \\?\, you can use paths up to the physical maximum of about 32768 (2^15) characters. Oh, and you can only do this with Unicode Win32 functions, btw. The ASCII ones just aren't hip to the jive.

    Either you do this magic trick, or you break the path down into manageable chunks, and SetCurrentDirectory a few times using relative paths until you can finally access the file or directory your'e interested in. But that's even more ugly, IMHO.

    In all fairness, this can happen in the Unix world too, and there's no magic incantation to set things right. All you can do is use the "more ugly" option and chdir as needed. Blech.
  • If you've got a \\server\share UNC, you can treat it the same way you'd treat a drive letter, basically, but if you want to list all shares for a given machine, you have to use a couple of completely different functions and pull in a whole separate dll, and there are another couple of functions to use to enumerate machines in an NT domain, in case you need to do that. And when you enumerate the shares on a given machine, you may end up with a few additional things you weren't expecting, like the hidden admin and IPC$ shares, plus any shared printers you might have. So you'll need to either handle these cases intelligently, or be sure to pass the right flags so you don't see 'em.
  • If you need to do the 32k character trick with a UNC, the holy mantra is \\?\UNC\server\share\path. I think. It's been ages since I needed to do this, and I'm not on a Windows box to test it at the moment.
  • Even in the most current versions of Windows, the reserved names of creaky old DOS devices are still "special". Try creating a file on disk called LPT1, or even LPT1.TXT. Didn't work, did it? Turns out that if you try to use a special device name, even with a file extension, even with a full path specified, Windows assumes you really want to talk to the device. As always, there's probably some unavoidable backward-compatibility reason behind this. And as always, there's a workaround. Preface your full path with the eldritch runes \\.\, and you can proceed as you like. Those runes tell Windows you're providing a literal device name (which just so happens to specify a disk device and a hierarchical name underneath it), in which case Windows can finally get it through its thick skull that you aren't interested in, say, line printer #1. Of course, if you use this method to create a file with a special name, and then try to access it with the tools of mere mortals (say, Notepad), fun happens. Well, mild fun. Usually your app just locks up.
  • NTFS is case-insensitive but case-preserving. Except when you tell it to be case-sensitive. If you use the flag FILE_FLAG_POSIX_SEMANTICS in calls to CreateFile, Windows uses Unix-style naming rules, and you can create multiple files in the same directory whose names differ only by case. This, uh, feature has been in Windows since the beginning, in order to support that creaky old Posix subsystem that nobody ever used. This wouldn't be that big of a deal except that most apps don't expect to see this situation and get very confused when they do, including Explorer. In WinXP and later, the Posix naming feature is disabled by default, but you can reenable it simply by tweaking a registry setting whose name I can't recall at the moment.
  • There's another class of special Windows filenames to consider. Every NTFS volume contains a few hidden files with reserved names ($MFT, $MFTMIRR, etc.), which aren't part of any directory and which require a little special handling, too. Any function that expects the specified object to have a parent directory (i.e. FindFirstFile) is guaranteed to fail.
  • And who can forget Windows' alternate data streams? I won't go into a huge long rant about them here, since you can find lots of existing rants on the topic with your favorite search engine. I'll just say that ADSs wouldn't be a problem a.) if they were easily visible at a command prompt, and/or in Explorer, and b.) if there was a simple API for handling them. Neither of these things is true. As a programmer, your best bet is to monkey-see-monkey-do with the code in MSDN. Feel free to try to understand what's going on with all those BackupRead or BackupWrite calls, if you like. And remember, streams come in a number of distinct types. I've found that many ADS tools only look at streams of the "standard" ADS type, which may not be sufficient if you're facing a clever attacker, or a new MS Office feature (which is basically the same thing).

    One additional fun quirk of Windows alternate data streams is that you can attach them to directories, not just files. And just like with files, you can only get rid of a directory's ADS by deleting the directory. If you attach a stream to a root directory, which you can do, there's no possible way that I'm aware of to delete the damn thing.

    If a stream happens to have a name, and you know what the name is, you can open it in any application with the syntax c:\path\filename:stream. If you've ever wondered why you can't use colon characters in regular filenames, here's why. The colon is a reserved 'separator' character that lets Windows know you're working with a named stream. You can treat named streams just like files, you can open or create 'em with Notepad, redirect data into them at a command prompt, and so forth, everything works like a regular file except that they don't appear under a directory listing, and they disappear if you delete the main file.

    Unless MS drops it between now and whenever Vista slithers out into the light of day, it looks like there'll finally be a stream API at long last. But if you need to support Windows versions before Vista, this will just add to your code's complexity, not reduce it.
  • There are several different ways of getting attributes on a file: FindFirstFile, GetFileAttributesEx, and GetFileInformationByHandle, and there may be others, in the future if not now. Each returns slightly different information, and each fails under slightly different circumstances. And what's worse, the file times returned by FindFirstFile may not always match those returned by GetFileTime. This seems to be completely undocumented, but from what I've been able to gather, FindFirstFile always fetches file attribute values from disk, while GetFileTime hits cached values in memory whenever it can. When you do something that updates the last access time on a file, the change typically isn't flushed to disk immediately, and several access time updates can happen before the value on disk changes. I've seen the on-disk value be as much as an hour out of sync with the in-memory version. There's probably an obscure registry setting somewhere to fine-tune this behavior to your heart's delight, if you've got nothing better to do.
  • Windows and Unix both provide the ability to lock byte ranges within files, and prevent other apps from reading or writing the specified range while the lock is held. If it's important to you to ensure that you read/write all of the file, or none of it, you'll want to scrutinize GetLastError() or errno if an operation fails or reads or writes fewer than the requested number of bytes. In my newbie days I used to think that if you could open a file, you could read or write as you liked inside the file, and that isn't always true. File locking isn't that common, and less common than it ought to be IMHO, but you may run into it at some point, and you'll see weird results if you don't handle it correctly.

    One more step to be aware of on Unix: If your app's going to run unattended, you might want to be sure you're opening files with O_NONBLOCK, so that when you run into a file lock, your read attempt will fail instead of blocking for an open-ended amount of time.
  • You can confuse Windows with other sorts of illegal filenames. For whatever reason, filenames ending in spaces or periods are Bad, and you normally can't create them, but you can if you've got a Unix box w/ Samba. Windows hates it when you do this, and your only option is to try to try the alternate 8.3 name instead, if the file's got one. Windows also really hates it when filenames contain characters less than 0x20. This is really hard to do; I pulled it off once by hex-editing the directory listing on a scratch floppy. Windows refused to open any of the files I'd messed around with. But this is sort of outside the scope of this post, since there's no workaround, and the odds you'll run across this in the wild are vanishingly small.
  • There are three different kinds of "links" Windows knows about. POSIX-style hard links, reparse points (a.k.a. "junctions"), and shortcuts. The first two are poorly supported and much of the OS isn't aware of them, while shortcuts are implemented all the way up at the Windows shell level, and working with them involves COM and all kinds of silly needless overhead. Command line apps can't do diddly with shortcuts. Yeah, sure, make the UI layer responsible for basic filesystem features. Great plan there, guys.
  • That's not the only thing implemented at the shell level. There's a whole separate shell namespace, rooted at the current user's desktop, with a whole new set of terminology to learn: LPITEMIDLISTs, monikers, antimonikers, and so forth. Any filename can be a moniker, URLs are monikers, all sorts of exciting things are monikers. One fun thing MS did was to come up with something they call "structured storage", in which subelements of a given document can have globally unique names. The classic example is Excel, where you can specify a range within the file c:\foo.xls with the moniker c:\foo.xls!a1:d10, if you're using an app that expects monikers, not literal paths. "Structured storage" uses the exclamation mark as a separator, and this is not to be confused with the colon used with alternate data streams. They're completely separate animals. Perhaps you could put an Excel spreadsheet inside an ADS and then use structured storage notation to address a range inside it. I've never tried that. Your mileage may vary.

    The problem is that this seems to be yet another M$ orphan technology, supported for Excel documents and CHM (compressed html) files, but nowhere else that I've ever seen. If there's a syntax for addressing ranges within Word documents, I've never encountered it. It's also worth noting that when you use this trick on an Excel spreadsheet, you're essentially loading and running Excel inside your application. When you pass it a "mailto:" url, you're launching Outlook in the context of your app. Is that a wise idea? I really couldn't say. I report, you decide.
  • That's not the only competing "global" namespace Windows offers. Within the NT kernel, the Object Manager namespace ties all sorts of kernel objects together. For instance, the file c:\foo.txt is really named \Device\Harddisk0\Partition1\foo.txt so far as the kernel's concerned, and the registry key HKEY_LOCAL_MACHINE\Software\Foo is really \Registry\Machine\Software\Foo. You generally don't see these names outside kernel space, which is a shame. Windows makes a bit more sense once you poke around the Object Manager tree a little. Fortunately there's a tool that lets you do this: WinObj, from the gurus over at Sysinternals, the same folks who discovered Sony's music cd spyware.

    Working with Object Manager names isn't hard, but you need to use the poorly documented (by M$) "Native API". As far as I'm concerned, if you want to mess around with this stuff, you need Gary Nebbett's book Windows NT/2000 Native API Reference. It can be a bit of a dry read, but the Bible and Larousse Gastronomique can be dry reads, too, and the Nebbett book is just as essential in its own intended field. The familiar Win32 API is often just a thin layer on top of the Native API, with a few minor differences in semantics, and some obscure Native API features that aren't mirrored in Win32 land.
  • Oh, but that's not all, far from it. If your app wants to care about anything that might have an ACL attached, the Native API will only get you so far. You can also have various kinds of "user objects" sitting around, which aren't kernel objects and so don't appear in the kernel object tree. Common examples include Windows services, both local and remote, and LSA and SAM objects, along with more obscure things like NetDDE shares. Oh, and did I mention that Active Directory objects can have their own ACLs too? Well, they can, in case it matters to you.
  • The aforementioned Nebbett book contains a short appendix that explains how to read any NTFS file by opening the disk device, parsing its MFT, and reading or writing to the contents of the file without ever opening it. This appendix is tacked on almost as an afterthought, but it's why I originally bought the book. Many moons ago, I needed to be able to read and back up a file even if it was currently opened for exclusive access by another process. Unlike most Windows features, exclusive access pretty much is exactly that. There's no obscure API flag, or registry setting, or process token privilege that lets you override the exclusive access thing, or if there is, nobody outside Redmond knows about it. Your only option is to go in at a very low level and read the bits off the disk device, so long as you're on an NTFS volume, and you really should check first before proceeding. This is obviously a pretty extreme measure, but hey, I'm all about extreme when I need to be.
  • And for the sake of completeness, there's also the weird object naming scheme that appears only in boot.ini and absolutely nowhere else. IIRC this has something to do with the firmware system on ancient MIPS-based NT boxes based on the "ACE" platform, from way back in the early 90's. Nothing ever really goes away on Windows. It's rare that anything even gets seriously deprecated. Usually MS just stops talking about it in public, and buries the reference materials in some dark corner off in MSDN. If they're really serious, they might not include the headers with the next version of Visual Studio, but that's about it. Whatever else you might say about Microsoft, they really are serious about backward compatibility. Especially the "backward" part.

No comments :