What's with all the XML hate? Of course, doing everything in XML is a stupid idea (e.g. XSLT and Ant) and thanks heaven that hype is over.
But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?
Doing hierarchical data in SQL is a bitch and if you want to transfer it, well good luck with a SQL dump. JSON and other lightweight markup languages fail the verification requirement.
XML is unnecessarily verbose, for the supposed sake of human readability. But used as a serialization format, it isn't really readable or editable by humans (except in the sense that a Turing machine is programmable): remember that the ML in XML stands for "markup language", and SGML, its predecessor, was designed as a way of marking up normal text, not littering data with angular brackets and identifiers. (XML/SGML arguably isn't that hot as a markup language, either.)
If you really need a hierarchical serialization format that is "verified for validity and syntax", the problem is that XML has prevented the adoption of something better (because it was "good enough").
If you don't need that, then XML is overkill and bloat and makes your format less readable than it could be. And you rarely need it, because either your data is computer-generated and -read, so there's little point in putting in extra schema checks, or schema verification is woefully insufficient (because it can't verify the contents of fields, relations between fields, or a ton of other stuff that can accidentally go wrong).
> But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?
He actually did address my question in a way: "[...] XML has prevented the adoption of something better (because it was "good enough")."
Which IMO is a sensible way looking at it. I too think XML is not perfect but if all the other stuff we're stuck with currently would be as good enough as XML, IT would be a place with less WTFs all around. ;-)
That depends on your specific goals. But essentially, XML schemas are sort of like attribute grammars, except with an unnecessarily convoluted syntax, and yet more limited in their expressiveness than attribute grammars (because whatever constraints you need have to be procrusteanized into XML schemas).
Even if you were to stick with XML semantics as is, you could improve the syntax to be actually readable and eliminate the angle bracket tax [1, 2].
Alternatively Carl Sassenrath was pushing Rebol in the past. See his blog post "Was XML Flawed from the Start?" - http://www.rebol.com/article/0108.html
If used sensibly XML isn't too bad. But there's a whole lot of cruft in the standard that seems to do nothing except make it harder to use. Part of this is a problem with popular libraries rather than inherent to the format, but we judge a thing by its ecosystem rather than in isolation. So: namespaces are a pain, making it much harder than it should be to just make my xpath work. DTDs are annoying, especially when a production system breaks because a remote server that was hosting a DTD goes down so now your parser refuses to load a file. User-defined entities seem pointless, and though most parsers can handle the billion laughs these days it wasn't always so. The handling of text nodes is confusing; whitespace is irrelevant except when it isn't. Specifying the encoding inside the document itself seems wrong, and supporting multiple encodings at all causes trouble (e.g. sometimes it's simply impossible to include one document in another inline).
Is XML schema really so much better than e.g. JSON schema?
To me it feels like there's an impedance mismatch between the kind of structures XML lends itself to and the kind of structures programs are good at dealing with. So for program-to-program communications with a certain level of validation I find Protocol Buffers is a much better fit. Conversely in cases where human readability is really important, XML isn't good enough compared to JSON.
> So: namespaces are a pain, making it much harder than it should be to just make my xpath work.
Namespaces exist to solve a real-world problem that happens in real-world use cases (SVG embedded in HTML, HTML embedded in RSS). While it would be nice to look at things that are complex and say "it would be less complex for these trivial cases without this feature", in reality there are then common use cases that become more complex or even impossible in the general case, which seems like a very short-sighted benefit. Namespace prefixes are really not that difficult to configure, and once configured XPath makes them very easy to use :/.
The biggest caveat with namespaces is that most people have never bothered figuring out how they work. The number of applications I've seen that have hardcoded namespace names instead of looking up the namespace uri for example, is horrifying.
Namespace prefixes are not that difficult to configure once you know about them. But if you're just starting with XML, probably because you need to extract some information from a document you've been sent, you don't want to learn the theory of XML, you want to get the data you need out and get on with adding business value. So you find a tutorial, you write an xpath, and it doesn't work. You try removing the foo: prefixes in your xpath, and it still doesn't work. This is not the experience that a technology should give new users. A default of matching ignoring namespaces would not make anything impossible.
Indeed. XML gets a lot of hate because it's so difficult to use. It would be fine if you could use it without having to care about the 100 features you don't care about and just use the ones you need, but pretty much every library I've seen makes parsing (or generating) a document a huge and complicated task, and most of it is completely irrelevant to the problem I'm trying to solve.
And because of this almost no-one bothers to actually handle it properly so you often can't actually use the advanced features even if you wanted to.
This varies greatly from framework to framework, and language to language. On the JVM at least, the dark machinery that handles the XML is rather rigorously correct. Parsing and generation are trivial, especially using JAXP. You have multiple ways of working with XML (objects, DOM, push, pull).
XML is "good enough" for a lot of cases. There are lots of tools to mess around with it too, which is really quite valuable when you're experimenting with various kinds of data or you're debugging. Being able to extract out stuff you're interested in XML format means you can perform a lot of complex manipulations quite easily.
The issue is probably that 99.999% of all XML use cases don't use (or need) the verification aspect. For all of those, XML is overkill. Besides, surely it would be possible to design a verification layer on top of JSON, for instance - the fact that one does not currently exist does not mean that XML (and abuse of XML!) should not be criticized.
One of the core aspects of XML that is really important is that no typing is inferred by the structure of the file unlike JSON. JSON is by nature tied to the JavaScript type system which is sparse and inaccurate. For example, if you look at the following:
{ "name": "bob", "salary": 1e999 }
Ah crap! Deserializer blew (in most cases silently converting the number to null)
I think it's refreshing to hear someone advocate XML instead of JSON, specifically because you bring up a good point.
The problem I think is that just because XML is human-readable, it's less sufficient as a format that is human-writable (I'm looking at you, Maven!). I believe this is the root cause that many people hate XML, even though it has a very sweet spot in application-to-application communication.
If you take the brackets and the closing tags out (use meaningful space) it's a hell of an improvement[1], . A format I really like (ok it's aimed at html not xml) is the slim templating language[2]. It manages to pack the same information in but is massively more readable.
Yeah this is exactly where my hate towards Maven configuration comes from, but it's more a testimonial of a bad fit for configuration files than critique towards XML. Java enterprise application configuration has the tendency to be very "expert-friendly", and this is where XML got its bad name from.
> Ah crap! Deserializer blew (in most cases silently converting the number to null)
Right -- the parser blew it. That many implementations do this is frustrating (and caused me so many problems that I ended up building my own validator for problems like this: http://mattfenwick.github.io/Miscue-js/).
JSON doesn't set limits on number size. From RFC 4627:
An implementation may set limits on the range of numbers.
It's the implementation's fault if the number is silently converted to null.
I guess we need better implementations!
> JSON is a popular format but it's awful.
If you're willing to take the time to share, I'd love to hear more examples of JSON's problems. I'm collecting examples of problems, which I will then check for in my validator!
If you're looking for examples of problems, RFC7159 (http://rfc7159.net/rfc7159) is a good place to start - just search for 'interop', as suggested by [1]. A quick look at Miscue-js suggests you already check for most of them, but you might still find something new.
Your example doesn't do anything but make XML look as bad as your saying JSON is. Think about it again - do you think your first XML example doesn't ALSO have to be deserialized twice (once into an XML in memory tree, once into a number)? It does. Also, both examples will fail if you try to deserialize either of them into numbers...
Regardless, JSON is so much more readable that I'm very glad it's pushed XML out of the picture for the most part.
XML can be read as a stream and at certain points like after reading an element or attribute, an object can be created on the fly or a property on an object set and the type deserialised at the same time. The types don't have to be native types either; they can be complex types or aggregate types such as any numeric abstraction or date type you desire.
See java.xml.stream (Java) and System.Xml (CLR) for example.
As for readability, some XML is bad which is probably what you've seen but there's plenty that's well designed.
XML is afflicted with piles of criticism which usually comes from poor understanding or looking at machine targeted schemas that humans don't care about.
You'd complain the same if you looked at protobufs over the wire with a hex editor.
What is that massive semantic difference? If you want the number represented by 1e999 as the value for salary, at some point, something has to take "1e999", whether you call it a string or a something-with-no-type, and turn it into a number. Your deserializer has to know to do that in either case.
How does the [deserializer] step in the XML example know to call into [bignum], and why can't the [json reader] in the JSON example have that knowledge in the same fashion?
Because the XML document has a semantic meaning that is specifically designed for this application. It may even have a schema definition document which formally defines what types to expect. JSON, by contrast, has type definitions imposed on it by its nature as JavaScript code.
I've sort of lost track of what this debate is about... Assuming you don't have a schema definition, it seems to me that you can just as easily parse `{ "salary": "1e999" }` with application-encoded semantics as `<salary>1e999</salary>` with (again) application-encoded semantics. Maybe having a formal schema definition is a win, though.
iff you have a schema, and a parser that actually uses it. I've seen a few DTDs but the vast majority of XML documents don't have a schema or even a DTD to follow.
And the vast majority of parsers will not parse anything for you, regardless of schema definitions.
Which effectively puts you in the same place as the JSON string.
Either to author of the serialized data realized that the numbers could overflow a float or didn't. This is independent of serialization format.
In your contrived example, somehow, the user of JSON didn't realize the salary could overflow a float. (OTOH, he succeeded in serializing it, mysteriously.) All the while, the XML user was magically forward thinking and deserialized the value into a big decimal. Your argument simply hinges on making one programmer smarter than the other. If one knows that a value will not fit a float, the memory representation won't be a float and the serialization format won't use a float representation. It has nothing to do with JSON vs XML.
This. Types are a huge pain in JSON, particularly the lack of a good date time type. BSON fixes tips, but only of you're using MongoDB and are willing to give up the "human readable" requirement outside of mongo.
OK, so the provided number format is not sufficient for the kind of numbers he is trying to deal with. So instead you would represent it as a string and handle the encoding/decoding of that number yourself. How is that different from the XML way where there is no provided number format to begin with, and everything is a string?
People seem to prefer JSON, but I don't find it any better to hand-write/hand-edit than XML. If anything it's slightly worse, because it has more syntax edge cases.
And it doesn't support the multitude of accurate numeric types that XML does implicitly. XML data is not just "strings", it's a sequence of characters. The deserializer determines what sort of type it is based on either the structure or the language's capabilities. With XML, you can define these policies. With JSON you're stuck with JavaScript being the semantic standard and type definitions which ties you to floats or numbers inside strings. The latter is criminal.
Edit: clarification as HN won't let me reply any more.
How so? XML by itself only supports strings; any other data types have to be derived from a schema. But you can do the same with any other format that supports strings, including JSON.
But in the design of XML this was already acknowledged.
That's why there is the distinction between well-formed and valid XML documents. Only with valid XML documents there is a schema attached that will describe these nodes with the type attribute. And because it is extendable, these types can be anything but they will be automatically validated by the parser.
JSON OTOH doesn't have this extensibility. There are a couple of predefined types but if you need to go beyond them (and this happens all the time because JSON doesn't even define a date type!) any interpretation is up to the parsing program and this can vary tremendously (again, look at the handling of dates and for example the questions on stackoverflow about them).
To be specific, JSON syntax is a subset of YAML version 1.2.
However, I hate YAML with a passion. It is worse than XML in my books. I can usually read JSON fine. I can also read XML in many cases. For the life of me, I just can't read YAML. It has something to do with "-", line indentation and different ways of writing lists.
Of course, someone will say YAML is technically better ...
Same here, it is very difficult for me to tell levels of nested structures in yaml. Though I'm sure if I sat down and read up on it I could force it into my brain. But shouldn't it be intuitive to read without that?
Python has exactly the same problem -- control-structure nesting quickly gets confusing and hard to read beyond a certain (fairly small) size -- but at least with python, you have the option of splitting off stuff into separate functions to limit the amount of nesting and size of blocks.
It's technically true because YAML includes an alternate "inline style" that lets you write objects in JSON syntax. Therefore any JSON object is a valid YAML object as well. But, not an idiomatically written YAML object, since writing YAML using only inline style is unusual.
Could you provide examples? I'm trying to collect more examples for a JSON validator -- http://mattfenwick.github.io/Miscue-js/ (built during a big project using JSON, after I started running into some issues that I couldn't check using other validators)
I'd love to hear more examples if you're willing to share.
I personally miss having schemas and XSLT in JSON.
> doing everything in XML is a stupid idea (e.g. XSLT and Ant)
XSLT actually made a lot of sense. If everyone writes code to transform format1 to format2 then what you end up with a lot of slightly different transformations. Its main downfall, just like XML itself, was that it was annoying and time consuming to write.
How would you replace all this if you moved away from XML?
> Its main downfall, just like XML itself, was that it was annoying and time consuming to write.
And impossible to debug. Write once, do something else for some weeks, and trying to understand what you were doing at a later point is nearly impossible.
There are schemas in JSON (see, for example, Kwalify) although they are not something that is built into the specification. I don't think the equivalent of XSLT is as necessary when the document readily translates to data structures in a scripting language.
The problem with XSD and DTD is they only offer primitive ways to validate data, and it takes significant effort to validate some data (eg,[ https://stackoverflow.com/questions/3382944 ]). As a result, there have been a bunch of other XML schema validators created to counter these problems, but we should really ask why we need to keep inventing new languages when the existing ones turn out to be insufficient.
If we start out instead with something that's turing complete and simple to begin with (perhaps S-expressions?), we can (often trivially) write our own validators/type-checkers, or any other processing tool to verify the document structure, with few or no constraints, and without requiring the effort and expertise to parse complex syntax.
Unfortunately, XML is way too open-ended for my tastes. You end up getting entire rows of DB content (with full text paragraphs and everything) entirely in one tag, with attributes and values. There are so many options that you typically get a lot of idiot programmers who don't understand the purpose of all the shit in XML, so they fuck up their implementation.
Simply put, XML does not correctly model the data by which we intend to interchange. It was a noble effort, but it didn't come from a place of innovation. It came from corporate needs for standardization.
This may seem innocuous, but XML allows mixing of arrays and objects too liberally, and makes automatic parsing overly complex. At first <customer> appears to be an array of account objects, but wait now that we reach the end we find that <customer> is an object with multiple keys and must create an unnamed array key to hold accounts.
XML is a document markup language, not a data format.
The really annoying issue is as the parent says, that the accounts collection does not have a name. This means there's no canonical mapping for the structure into a programming language object, which necessitates that libraries require annotations or some other side-channel way of specifying how to wrap the accounts into a collection.
In Jaxb e.g., how many times must we add junk like:
@XmlElementWrapper(name = "accounts") ?
In any individual case the workaround is easy, but it's annoying to have to do it repeatedly.
XML really is better as document markup than structured data representation.
People dislike XML because it's way overkill for %99 of people's use cases but it still gets used anyway! Most people who use it should of been using something simpler like JSON to create their configuration file or return their list of strings in some HTTP API. You can have bloody security vulnerabilities with XML, like you had recently with facebook: https://www.facebook.com/BugBounty/posts/778897822124446
The likelihood of a JSON feature biting you in the ass like that is far lower. Don't use XML until you actually need something XML SPECIFICALLY provides.
Also JSON easily translates with easy to work with dictionaries and lists, XML parsers take more code to work with equivalent items.
> But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?
S-expressions work great. Syntax checking is far simpler, and validity checking is hence something you can roll yourself (and writing an S-expression schema checker ain't tough).
I haven't used it myself, but lisp seems well suited to the task. I've also heard good things about yaml, which is more well-supported by your language of choice.
Because XML solves a 'problem' in the worst possible way. It is not that easy to parse for machines and only the simplest XML files are readable by humans.
Besides, since 1960 or thereabouts we have S-Expressions. The world should just have used that without reinventing the wheel once again.
But if I want something that is able to express data structures customized by myself, usually with hierarchical data that can be verified for validity and syntax (XML Schemas or old-school DTD), what other options are there?
Doing hierarchical data in SQL is a bitch and if you want to transfer it, well good luck with a SQL dump. JSON and other lightweight markup languages fail the verification requirement.