Sunday, October 30, 2022 at 6:01 PM

The decades long quagmire of encapsulated HTML

This is part of a series of posts I'm writing about creating an HTML Document format:

So my first thought when I started down the path of creating Hypertext HTML Document Editor was that I wanted an "all in one" file format which would include the CSS, images and other assets that a web page usually needs. I figured, "How hard can it be to throw together a zip file with an index page and a folder of files that an editor can read?" Silly me! It turns out that during the past 25 years or so, lots and lots of other developers and organizations have thought the same thing. Yet apparently it's one of those things which seems to never get any sort of traction.

After taking a look at what's been done already in order to not recreate the wheel, I came to realize a few things: First, I decided that nothing that's been done so far would have the simplicity nor widespread support that I was looking for. Secondly, trying to create Yet Another Bundled Web File Format was a really bad idea. But most importantly, since my goal was to replace Markdown as a simple, easily manageable rich text document format, trying to also tackle the bundling issue was besides the point. Markdown doesn't include files, doesn't care about them and will never have them. There's no reason for an HTML Document standard to deal with encapsulated files either, as essential as it might first seem.

All that said, I'm sure someone, somewhere might wonder why my Hypertext editor doesn't have some sort of option to save as a zipped file format of some sort, I thought I'd share what I discovered reading up on the topic. Then maybe someone can get the W3C, Mozilla, Google and Apple to finally make up their minds and decide on a standard that we can use in the future. (I won't hold my breath.

So, let's take a look at some of what's out there. But first, a look back in time.

Back to the future... .htmld

You won't be surprised to find out that way back in 1995, ideas for the web encapsulation problem had already been batted around, and some had actually been implemented. I found a message from someone named Dan Grillo at NeXT to a CalTech email list about the .htmld file format used on NeXT servers. (See the original here.):

From: Dan Grillo <Dan_Grillo@NeXT.COM>
Subject: What's in a .htmld or .htmd?
To: webstep@xent.caltech.edu, khare@CALTECH.EDU

> This one is a raging open question, but probably the easiest settled
> Generally, we want such a format to be self-contained, so the intention is that
> the wrapper can include other files, directories, symlinks; and that such
> structure should be preserved by cooperating applications (eText, for example,
> garbage collects). Here are a few of the design choices:
>
> .htmd (HyperText Markup Document) vs .htmld
> index.html vs. _____ (TXT.rtf, multiple topics, a la sanguish)
> multi-format representations (index-ascii, index-mono, index-color)

I'll continue with this one.

First, .htmld is already registered with NeXT's type registry;
I did it a long time ago. I don't think .htmd is.

What should be in a .htm[l]d? Right know I know of 3 forms.

1. foo.htmld/index.html
2. foo.htmld/TXT.html
3. foo.htmld/foo.html

I think forms 1 & 3 are useful.

Pages, StepWise, and NeXTanswers all currently serve files saved as .htmld
Right now NeXTanswers uses form #3, so the internal .html can be FTP'ed
or saved from a web browser and used on it's own. This is hard to
do if they all are index.html.

--Dan

Of course the thread goes on with a bunch of questions about the format, and as we know, it disappeared from existence. It just goes to show...

MIME: .mhtml

In 1999 the MIME encapsulated HTML format was standardized. Here's the Wikipedia description:

MHTML, an initialism of "MIME encapsulation of aggregate HTML documents", is a web page archive format used to combine, in a single computer file, the HTML code and its companion resources (such as images, audio and video files) that are represented by external hyperlinks in the web page's HTML code.

The content of an MHTML file is encoded using the same techniques that were first developed for HTML email messages, using the MIME content type multipart/related.

MHTML files use a .mhtml or .mht filename extension. [Also commonly .eml as well. -R]

This is still used for emails formatted in HTML and you can still export pages from Chromium browsers ("Save As -> Web Page, complete") where a page is saved literally as a fake email, complete with a "<Saved By Blink>" in the from: header. (That special from: tag actually turns on some restrictions in Chromium, so it's not just a random entry, without it the browser apparently goes into a less secure Internet Explorer mode.) I had never actually looked closely at that page before, so I was pretty surprised to see what a hack it is, and yet it's the way Chrome still works 20+ years after the format was defined, with little change.

In terms of using it as a basis for a cross platform document format, it's not supported by Firefox any more, nor Safari. Even though it's all nominally text, it's not a particularly user-friendly format, using as it does a mess of MIME-encoded HTML text with tons of embedded MIME-encoded data chunks. Though it is nice that the page has everything it needs, and is separated into sections (embedding a data:// url is much more disruptive in terms of content flow), it's not compressed in any way, and not something you'd want to store an .mp4 video in and send it to anyone.

Amusingly, on my Mac, the .mhtml file format is registered to Microsoft Word, not the default browser. If I had Outlook installed, it would probably claim first dibs. Opening the file in Word ignores all the embedded files - for security reasons I presume - making it a bit less than useful in that app. Opening an .mhtml file in Chrome also has a bunch of odd restrictions as well. And examining the page in DevTools doesn't really expose all the magic that's happening - like the process of decoding the file attachments nor the explicit security issues. It's very weird.

A Bugzilla entry from 23 years ago(!), showed some enthusiasm for using .mhtml in the then upcoming Mozilla Editor, though I'm not sure if that ever happened. Just shows again how long this topic has been floating out there. Apparently, XUL even had an <editor> tag.

Side note: MIME Types vs. File Extensions

Slightly off topic - but since I'm already talking about MIME. Linux is the only OS that actually pays attention to a text-file's MIME type, like "text/html", etc. Windows and macOS both depend on the file extension to decide what type of file it is. Thus, if you save an HTML5 file as .xhtml it will be treated as XML file - with the strict XML validation rules - regardless of the DOCTYPE or other meta tags in the header. If you name it as a .mhtml file, it will add extra security to the page, preventing it from opening images and other links from external sources. Using MIME-encoded HTML, though possible and even considered as a solution at one point to encapsulated HTML, really isn't really a good solution today.

Mozilla's MAFF

From Paolo Amadini's website :

MAFF files are standard ZIP files containing one or more web pages, images, or other downloadable content. Additional metadata, like the original page address, is saved along with the content. Unlike the related MHTML format, MAFF is compressed and particularly suited for large media files.

Support for reading and writing MAFF archives was provided in the Mozilla Application Suite, Firefox, and SeaMonkey thanks to the Mozilla Archive Format add-on from 2004 to 2018. While the original add-on is no longer maintained, the file format specification is still available and can be referenced by third-party software to provide better interoperability.

Seems pretty simple! Too bad it's dead.

Checking out the file format spec, you can immediately see it's a product of its 2000's era conception: The meta-data file which specifies which files are included in the zipped bundle has to be .rdf: An oddball XML spec which doesn't make any sense to anyone not really into the "semantic web", though it's not particularly horribly done in this instance. Paolo's site has a few examples, and it seems pretty simple. My best guess for why it died is browser security concerns about pulling in some random zip file into the browser.

A MAFF spec v2 could easily solve some of its shortcomings by using a JSON file as the meta, requiring a CSP header on the index page to prevent malicious scripts from running, and adding in a signature of some sort to guarantee that the file matches what the manifest says, similar to signed JAR files. But I guess that would be a totally different spec.

"Web Page, complete"

Nowadays, Firefox doesn't bother with a single-file format at all, not even MHTML. Saving a page as "webpage, complete" simply creates a folder with all the page's assets and it rewrites the URLs to point at it, just like Chromium does using the same save as option. (By the way, is it "Webpage" or "Web Page"? Chrome uses the first, Firefox uses the latter.) I find both browsers defaults to save the folder with the same name as the web page's title, complete with spaces, incredibly annoying, as it ends up having to put %20 everywhere in the URLs, or worse (Chrome) just including the spaces in the URLs, making what would be a relatively clean export into a complete mess. Also, rather than separating out the various media types - css, images, icons, etc., the browsers just dump them all into a single folder.

Apple's .webarchive

Of course, Apple has to be "different", so Safari doesn't have an option to download a web page and all the assets into a separate folder (in other words, "complete"), instead it only has the option to save in .webarchive format, which (according to Wikipedia, again) is, " a concatenation of source files with filenames saved in the binary "p-l ist" format using NSKeyedArchiver". If you ever want to see what's inside, you can convert the binary p-list into an text plist by using the command plutil -convert xml1 foo.webarchive -o foo.plist and you'll see that Apple simply just saved all the page's assets in a bog-standard Mac XML property file.

This format is supported by macOS Safari as well as iOS, though I can't imagine many people actually know about it, let alone use it much. There doesn't seem to be any projects out there that open or create a .webarchive using JavaScript, though there are some written in Python, Go and Ruby. I can't imagine it would be too hard to figure out, but then only Safari has support for reading the format natively, which isn't the idea.

Apparently, in addition to being Apple-only, .webarchive files aren't particularly safe to share. The way it stores scripts allows it execute in the context of the opener. It's basically as simple as MAFF above, but done using Macs/iOS in mind.

While I'm here, I'll express my long held belief that whoever came up with Apple's property list format didn't really understand how XML and/or SGML-like tags actually worked. Does it make sense to you that p-lists have stuff like <key>WebResourceData</key> instead of simply just <WebResourceData> ? It's like they were confused.

The eBook File Format - EPUB

So the ePub format seems like it might be an ideal web document container, but it's got quite a few limitations, the biggest of which is that it requires XHTML, rather than HTML5. It's also focused quite specifically on eBooks, not on documents, which are sort of the same, given that the former is made up of the latter.

Using Apple's Pages word processing app, you can export your doc as an ePub, which is pretty cool. It's just a zip file, so you can open it up and see what sort of decisions they made to create it, and how they include CSS, images, etc. The problem, of course, is that it's a one-way export. One would think that since the ePub nominally contains all the parts that make up the document - and the possibility to include as much metadata that may be missing - that they might create some sort of "editable ePub", but sadly they didn't.

A .pages document, just for those who are curious, is a zip file filled with some XML meta data and .iwa "iWork Archive" binary files which are compressed with "Zippy" - a Google originated compression format. So weird.

Web ARChive - WARC

Just to be confusing, there's another format out there called "web archive" as well. Did you know that the Internet Archive uses its own HTML bundling format - ISO 28500:2017 - which is supported by the United States Library of Congress?? I had no idea...

A WARC format file is the concatenation of one or more WARC records. A WARC record consists of a record header followed by a record content block and two newlines; the header has mandatory named fields that document the date, type, and length of the record and support the convenient retrieval of each harvested resource (file). There are eight types of WARC record: 'warcinfo', 'response', 'resource', 'request', 'metadata', 'revisit', 'conversion', and 'continuation'. The content blocks in a WARC file may contain resources in any format; examples include the binary image or audiovisual files that may be embedded or linked to in HTML pages.

It's not meant for an individual file as much as it's meant for entire websites. But in case you were ever wondering how archive.org stores snapshots of the entire Internet, now you know. Pretty cool.

Miscellaneous other formats

Compiled HTML - Microsoft created their proprietary .chm format in the 1990s for its Windows Help system to use. It's actually still supported in Windows 11, though the format hasn't been updated since the early 2000s.

WACZ - Web Archive Collection Zipped - is used by the WebRecorder project, which seems to be an active effort to create an open standard for web archiving, though you wouldn't realize it by the design of their website. I almost thought it was also a dead 1990s effort until I saw the August 2022 update.

SingleFile and SingleFileZ is an actively developed web extension for archiving web pages. It takes all the binary assets and converts them into data:// urls and then embeds them into the page automatically, so the end result is a (very large) plain-text .html file. Basically it's sort of like MIME, but using data urls. It works, but I wouldn't want to try to post-process the file.

Current efforts

So I wrote all the above relatively quickly, as I had a familiarity with most of the stuff I listed, or it wasn't too hard to look up and get familiar with the details. Then over a week has gone by while I tried to wrap my head around what's going on with the latest efforts by the W3C Web Incubator Community Group (WICG) and browsers, specifically the Chrome team. I can't promise I still have a handle on it, so if you see something wrong or see something missing, please let me know. There's a Github repo for the W3C Web Package community group which has an overview of what they're trying to do.

From what it looks like, it's a concerted effort to finally figure out a standard way to bundle up various bits of a web site or web app into a safe, secure single file archive for a few different use cases. The first is to save web pages, like the various solutions I wrote about above. Another is so that browser clients - especially on mobile - wouldn't have to request so many tiny files over the network. Another is that a standard archive could also be used to package web apps or whole web sites for installing or sharing offline. All well and good. However, concerns about privacy, safety and security has caused a lot of headaches, which they're trying to address in various ways. Here's their explainer. There was also an effort by the W3C Digital Publishing Interest Group to define the use cases for web publications. How all these groups inter-relate besides linking to each other in their docs, I have zero idea.

Side Note: The Concise Binary Object Representation - CBOR

There is a new, open-spec archiving file format called CBOR. It seems to underpin the web packaging efforts currently going on by the browsers and W3C. CBOR is basically like Apple's binary p-lists, except that it uses JSON to organize all the meta data and compressed binary data contained within the file, which can be addressed and extracted individually without decompressing the whole file. Signing and Encryption has been built in (though covered by another spec called COSE), so it's apparently good solution for a variety of problem domains, including messaging and apps.

Web Bundles - .wbn

So the latest proposed bundling format is called Web Bundles, which is still in draft stage and available behind a flag in Chrome. Though it nominally addresses the problems listed above, its announcement immediately got blowback because of the idea that web servers could potentially start to serve all their content using bundles. The problem being that if sites started only serving bundles, it would essentially break the web as we know it, because each of the individual files would be opaquely contained within the archive. This would mean that ad-blockers would stop working, and it could potentially have security and privacy issues as well, as who knows what sort of stuff could be contained inside the bundle. The guys at Brave were among many who lost their proverbial shit over the proposal.

The one and only post from Google about it was back in 2019... so I think the pushback may have been enough to stall this spec for a bit. As of right now, none of the other browser makers seem interested. Also, despite being available via a flag in Chrome, I've yet to get any of the .wbn bundles provided in that post to work on my computer.

This doesn't mean web bundles are dead and gone, however, as now it seems the focus has been changed from a general, all-purpose solution for bundling web assets, to being a way of bundling PWAs. This is definitely something that's needed, so I can't see it maybe getting a less harsh reception. Web Bundles may also contain "Isolated Web Apps" which are archived PWAs with strict CSP limitations which help with security.

Summary

I think that's about all the relevant specs, formats, and proposals for web archiving out there. This whole post started as an attempt to learn about what sort of solutions were available which would be a "standard" that could be used for a self-contained HTML Document, including CSS and Images, mostly. Like I thought in the beginning, how hard could it be to throw a bunch of files into a zip and call it a day? Apparently, quite a lot.

Sadly, not even the latest efforts really provide a solution, as they're focused on web apps and thus have to figure out how to make the bundles safe and secure, and as a result, aren't something that would be easily editable like an encapsulated document - ala .docx or .pages. The one format, ePub, might be a solution, but it's focus on books and its use of XHTML means that it's not suited at all for something easy and simple.

My next post will be about a proposal for an HTML Document format that I think could be implemented in browsers within a reasonable time, and some ideas as to how that might happen. You can probably guess from the above: It will be text-only, so a MIME-style solution with embedded base64 blobs similar to SingleFile's solution, plus a strict Content Security Policy ruleset which disallows scripting for safety and security. In order for this to work, it would need to have its own file extension - like .mhtml or .xhtml - so that browsers could automatically add those restrictions to HTML Documents opened from the file system.

I'll flesh it out in the next post.

-Russ