My first target for Summer of Code is simple: to find a tool for extracting differences between two Pathways. Pathways are stored in gpml, an XML-based format. So my first idea was to look for an XmlDiff utility suitable for use in wikipathways. I spent some time reviewing all the various XmlDiff utilities out there. It turns out there are quite a few. And you know, if there are a dozen equivalent solutions for a certain problem, then you can bet that probably none of them are going to work for you.
So while reading more into what makes a good XmlDiff tool, I found some interesting whitepapers that caused me to rethink the problem. I now believe it would not be hard to write my own version of XmlDiff that is completely optimized for my specific problem, namely finding differences between Pathways.
With plain text, the differencing problem was solved nicely a long time ago with the famous diff utility. Basically it takes a text file line by line and considers if a line was either inserted, deleted or just changed. For this it implements the Longest Common Subsequence (LCS) algorithm.
(Incidentally, I learned that the LCS algorithm is exactly the same as the Needleman Wunsch algorithm, well known in Bioinformatics as an old solution to the Sequence Alignment problem (see image). Clearly, Nature edits DNA Sequences the same way as we edit text files: Insertions, Deletions, Mutations).
Now although XML is also text, it has a neat hierarchical structure that we would like to make use of. The line opeartions for diff don’t make much sense for XML. LCS also doesn’t cut it, so I had to read up on stuff like Nodeset correspondence and Tree correction algorithms.
So here is my main problem with all these XmlDiff tools: no two XML formats are the same. Gpml is not SVG, which is not XHTML. The idea behind XML is that you can use a standard set of tools for any type of document because they are structured in the same way, but in this case I think we’re better off with seperate tools for each type of document. Gpml is much simpler than the superset of all of XML. You can think of a whole bunch of operations that make sense on certain XML documents but not on Gpml:
- Move a subtree up or down in the hierarchy (In gpml, all tags always live at the same level. <Pathway> is always at the root, <DataNode> is always just below <Pathway>)
- Change the order of elements (In gpml, the order of elements does not matter)
- Move a subelement of <DataNode> to a different <DataNode>. (Technically this could happen, but it just doesn’t make any sense)
These editing operations just don’t make any sense for Pathways. A generic XmlDiff is unnecessarily complex, and has output that is hard to interpret.
So, GpmlDiff instead of XmlDiff. A good decision, or just a case of
not-invented-here syndrome? Let me know what you think 🙂
I’m already making good progress with the implementation of GpmlDiff. I only need to decide how to format the output. The only available standard is XUpdate, but that seems to assume ordered Xml. Suggestions are welcome too!