Hey, what's the best way to store HTML, XHTML and XML text into a database? In other words, now that we're all striving for XHTML correctness, should I validate what the user has written in a weblog, for example, before it gets to the db, or store whatever they come up with, and fix it when it comes out?
The first is the "right" way to do it. I mean, users shouldn't have to worry about XHTML tags in the first place, but until we get decent WYSIWYG editors on our browsers, they pretty much have to. So doing a weblog the first way would validate the entry for validity and correctness and barf at the user if it wasn't perfect, with suggestions for corections and hell, fixing up their damn spelling and grammer while it was at it. Once the text is correct, it can be saved to a db - or even directly to a web page forever.
The second way is the quickest to do and most user friendly. You just take whatever they write verbatum and throw it into the DB. Then you do some work and fix it up when it comes out. This sort of sucks for several reasons - first, doing the fixing is prone to errors. HTMLTidy is nice utility, but sometimes it'll just give up on some text if its completely screwed up and not do anything, so that means doing some real work and making an real effort programatically to clean up the markup. Secondly, this means processing time every time the data comes out of the DB. Thirdly, you can't serialize the entry to plain text because you always have to clean it up before its presented. I wrote about this method before - essentially, if you're cleaning up what the user writes every time, you might as well use any sort of markup you want, not just HTML.
And that leads me to what I guess is a third way is to do something like Textuality's wiki-like markup, which uses things like *stars* to represent bold _underscores_ for underlines, etc. Again, you need to translate it every time it comes out of the db, because otherwise if the user wants to edit the text again, you need to "untextualize" the markup...
What about storing both types of data? One for the original user-inputted text and another for the cleaned up version - to make sure you don't lose anything in the process? Seems like a waste of space.
Though it seems like the first solution of validating the data before saving is the most obvious - it's not if you start thinking about various ways to enter data. What happens if you're moblogging and you send off an email to post with some markup that's bad? Do I 1) ignore it, 2) clean it up automagically, even though it might screw up the post 3) send back an email with corrections, requiring the user to send yet another email to post? (repeat ad nauseum?).
It seems like a simple thing, but it's really quite a real question.