|
|
|
|
|
|
|
|
|
|
|
|
|
Note: a much more comprehensive history can be gleaned from the Chronology pages, which however, also contain all sorts of additional random information and opinion of the author, and since the release of SiSU as Software Libre under the GPL in the document changelog.
While working with legal texts and in an academic environment, a site that was first called Ananse, The International Trade Law Monitor and later still Lex Mercatoria, 263 I was faced with a number of issues, those of interest here being technical. Amongst them was the relatively fast evolution of html, (in which text was prepared for the Web), which made having to continually update text/document representations to reflect the improvements in what was possible with the latest html markup cumbersome. There was also the fact that some of the strengths of html were limitations in other document representational contexts, e.g. good document rendition across multiple screens was a different problem from ideal paper rendition. Also within an academic and law environment one of the limits of html repeatedly presented as critical with regard to academic writing was the fact that it was not possible to reliably cite the location of content within a document. HTML rendered differently in different browsers; change the font size and it again came out differently. This lead to work on figuring out how these limitations could be overcome, which resulted amongst other things in the early development of the object number system, that could be used independently of page numbers to locate text.
The use case came to be scholarly writings in law and literature, and conventions and useful across writings in literature, the humanities and law, and a smaller section of the social sciences.
SiSU came to be through a series of steps which started from seeking to overcome these problems, starting with the recognition that multiple document format types could be generated (and technically updated as need be) from a single lightly structured prepared source text/document, and that these multiple output formats could share a common numbering system for the referencing of text within a document and further, that to achieve this text could be usefully represented as individual objects identified by these object numbers, and these could be the building blocks from which the alternative document representations and formats could be built, to take advantage of many of the individual and distinct native strengths of various primary standard ways in existence, for the convenient representation or extraction of text, each idealised for a different context, amongst them html, XML, ODF, LaTeX, pdf and (SQL type) relational databases.
Seeking to achieve the requirement of minimal effort (in the form of preparation and maintenance) relative to payoff as regards the described objectives: the idea was to have a document structure meta-markup that with as little effort as possible initially and over time (it should be possible to develop (change or add) output formats without having to think about the original source document), was able to the greatest extent possible, to take advantage of as many of the most interesting features available in each of the most important standard document representational methods, viz. html, XML, ODF, LaTeX, PDF and SQL type relational databases, from that common prepared document source, and that resulted in a meaningful common way of identifying text content.
This resulted in: (a) a minimalist/light structured markup from which the primary benefits of multiple document representation types could be generated. 264 Keeping markup/preparation relatively minimalist and easy to remember, and independent of the development/evolution of document output representations, in order to keep document preparation effort to a minimum, both initially and with regard to maintenance over time; (b) having an abstraction layer for the representation of the document, that was generated independently of the prepared source, which represented text as numbered objects that could be utilised in any of the final document output representational forms in a shared/ common/ similar way for the location of content within a document 265 Separating markup from abstraction and subsequent outputs meant that the markup syntax and underlying output generating modules could be developed/evolved independently of each other. You could arbitrarily change the markup syntax (or have alternative preparation syntaxes) provided you could generate the abstraction layer, from which subsequent outputs would result. Or you could change the abstraction layer and related output generation modules whilst retaining the markup syntax.
The first technical work that in any way relates to the way SiSU works dates back to earlyish in the history of the site Lex Mercatoria, which was at the time called Ananse, (and later the International Trade Law Project and then International Trade Law Monitor). Looking for more convenient ways to manage site content, while at the University of Tromso, I had a young student Tommy Johansen look at it whilst over a summer. I (and Geofrey Armstrong) at the time gathered content for the site. Tommy Johansen wrote some Perl scripts for generating html content, which were used early in the sites history and which were convenient in particular for: (a) producing uniform output, (b) separating code from markup, (c) their ability to produce tables of content, (d) the possibility of matching text in a header to segment text (not yet regular expressions). After Tommy Johansen left his scripts were used, pretty much unchanged for a good while, and though this was before text objects, or object numbers, document abstraction, or any document representation other than html, these were features that were retained by what was to become SiSU.
In 1997/1998 object numbers were introduced to html output, overcoming the problem of the precise location of text within a fixed/published html document. The possibility of using text objects (and object numbers) for other forms of output was conceptually conceived around the same time as the introduction of object numbers to html, as it was clear that this system should have wider use across different types of output. 266
In 1999 I was switching from Windows to Gnu/Linux... first Red Hat then SuSE 267 as far as SiSU was concerned, the program was written in Perl and relatively easy to port. 268
In 2000 I was switching from Perl to Ruby... well that was the end of 2000, November (Dave Thomas' book which I was waiting for from the beginning of the year was published at last, and I finally received my copy). 269
By June 2001 SiSU was generating LaTeX output that was converted to both portrait and landscape pdf that shared the same object numbers as the html output.
In May 2002 tired of waiting for the version dubbed Woody, I was switching to Debian... 270
SiSU search was finally actually implemented in 2002, 271 in the form of the database structure that made object search possible and the ability to populate the database with objects with corresponding object numbers from same document source as other output formats. I did not have much of an immediate incentive to implement search as I did not have an online database. However, having an implementation and showing it around was the reason for the initial opening of these pages and placing a description of what SiSU did on the Net in November 2002, ‹http://www.jus.uio.no/sisu› and updated regularly if haphazardly 272 since, and a pdf chart/diagram that included the relational database aspect as a feature, which should still be available at ‹http://www.jus.uio.no/sisu/diagram/sisu.chart.pdf› (prepared in 2002). 273
Concordance files, first called "wordmaps" were introduced the same year 2002. The search front-end has continued to evolve, and screen-shots of that were made in 2004.
In June 2004 an IBM software innovations evaluator (at first reluctantly) met me, (he was busy at the time, though the contact was arranged through an IBM Manager met at a Linux show, who was curious about what a lawyer was doing with Linux and programming, he asked what is it you are doing and said "we [IBM] should have a look at it"), anyhow, the software innovations evaluator had a look at SiSU and gave it a very positive/ enthusiastic review (so naturally I thought he was great), this was not a code review, mind, it was a "review"/reaction based on what it SiSU did and how it did it, and the implications of it all ... what it meant could be done. To paraphrase, he said:
We have large document management systems. We can search over a hundred thousand documents and tell you that your search criteria is met by say 300 of them, but there is no way we can tell you without going in to each document, where those matches are... once you open a document we can highlight matches.
He wrote a letter I kept and published as a souvenir.
"Ralph Good to meet with you today, I was very impressed with your software.
[colleague's name] - in summary - Ralph has built an application that runs on linux and takes ASCII documents and pulls them apart in to the smallest constituent parts, storing them as XML, PDF and HTML, the HTML are hyperlinked up so the document can be browsed in its full form. the format and text data created is stored in a database.
This has potential in any place that needs the power of full text search whilst holding the structural concepts of the document i.e. legal, pharma, education, research.. which ones we need to figure out, ..."
He suggested I get a software patent. I reluctantly agreed to investigate (that story is told elsewhere).
Subsequent meetings with IBM were odd ;-) 274
Well the person who arranged the original meeting with the "software innovations evaluator", did say that IBM was such a large organisation that different groups were working on different projects and had different interests, and frequently it was a question of meeting the right people; and that there usually were multiple entry points which could be quite different in their interests and responses. Interesting encounters, entertaining mail.
I was an example of a prime beneficiary of Software Libre, and one who had come to understand/know (believe if you prefer) through use that it was technically superior to proprietary software.
In January 2005 SiSU was first released under GPL.
May 2005 first Debian packages for SiSU. I had visited Wookey earlier in the year as a shortcut to building my first Debian package.
In July 2005 at Debconf5, Helsinki, 275 SiSU was first uploaded into Debian, by Gunnar Wolf.
At Debconf5 after talking to various people, it was clarified to me that generating hash sums was a fast and not particularly memory intensive process, so the decision was made to incorporate md5 or optionally sha256 hash sums into the document abstraction representation, as this makes possible several additional/alternative forms of document representation that rely on the hashes for unique identification of objects (also across document collections). Document Content Certificates were introduced shortly afterwards that make use of the hash sums to identify objects - headings, paragraphs, footnotes, images etc. and make it possible to evidence the existence of a document's contents without actually publishing it... or show a summary proving that the document remains unchanged.
In March 2005 with internationalisation in mind, character representation for source documents was switched over to Unicode UTF-8 ... and as a result output readily available across most languages in: html, XML and SQL database representation (PostgreSQL and SQLite), tested to be OK even for Chinese... LaTeX / PDF output, and for ODF, work across several European languages, but need further implementation work for other languages that not yet covered.
Open Document Format output was first introduced to a SiSU release late in 2005 (October).
Manifests that summarise the generated output made available, were also introduced late in 2005 as were Zipped versions of SiSU markup containing all related documents and images (sisupod.zip). These latter being a bit interesting as they gather the constituent parts of a document, which include the source document and any images, (and in the case of multilingual documents, may contain multiple language versions of the source document), in a single zipped file, which can be emailed, and which outputs can also be generated from.
In 2006 I got to visit Oaxtepec, Mexico for Debconf6
Alternative XML representations for SiSU markup were introduced in 2006 shortly after Subtech... they provide 3 forms of XML (SAX, DOM and a Node based tree, that can be converted to and from SiSU markup) these work though are largely proof of concept and require further work, especially as regards what the XML should most conveniently be.
Since the release of SiSU code and features have continued to evolve gently... Over the years many "requirements" have been requested, and incorporated, too many to make mention of here, including amongst them things like "canned search" in the sample cgi search forms to fairly complex footnote alternatives, and alternative XML representations of the input text. Since 2005 (SiSU becoming Software Libre), most of these have been mentioned in the changelog, and a few others may be evident from the Chronology pages dating back to 1993.
Wookey has been a Debian mentor (he introduced me to Debian packaging, and did uploads subsequent to the initial upload of SiSU), in recent times the greatest indirect support (i.e. not coding/programming or developing SiSU directly, that has now run to date for around 10 years now solo) has come from the young Daniel Baumann who is amazing in providing feedback especially in relation to how to package and things technical in Debian, and who has been extremely generous with his time and expertise.
It was not until March 2007 that a sample search database was put online which can be found at ‹http://search.sisudoc.org›
A rule of thumb for SiSU remains that what it does - the idea, and what it means can be done is more beautiful than the code, which is again a lot more beautiful than these descriptive pages... for which there has been little time and attention, but which indeed I return to and have plans to work on.
263. which explored the potential of the web starting in 1993 for sharing international treaties and conventions related to international trade an commerce (primarily related to private international law that had been published by various institutions with the goal of harmonising law in the field
264. this I am pretty satisfied with although there could be alternative preparation syntaxes, and indeed 3 forms of XML are recognised though they are transformed to basic markup for processing
265. there are a number of ways I could think of further developing this, in particular the current model gives object numbers to substantive content, there should be an alternative system for any non-substantive content; the current model provided secondary identification to certain type of text block, e.g. headings and paragraphs, this should be extended to identify all types utilised by the system
266. The name SiSU was not yet used for the software under development at the time, and indeed a name was not needed particularly as it was not shared, but this was an essential feature of what came to be named SiSU). SiSU evolved out of a need to address some of these issues and having come upon a conceptual solution to address several of them.
267. it took a while to realise just how superior this environment was for development (or indeed generally), well that it was an improvement was immediately evident but the realisation of just how much, that came with experience.
268. at the time of the switch Active State Perl, or whatever it was called, Perl on Windows (NT) appeared to have memory leaks, I had to reboot Windows NT several times each day to free memory, which was painful. This problem vanished with the switch.
269. I was having problem managing my Perl code, no doubt my own fault, but too many parts of the code at the time were dependent on each other in ways that were difficult to keep track of, so it became increasingly risky to make changes. Ruby I had identified of being of interest early in the year in a flamefest between Perl and Python coders. I had no interest in Python, Ruby on the other hand immediately sounded like being of being of potential interest... and then I had read and enjoyed the Pragmatic Programmer, written by the same Dave Thomas who it turned out was writing the first English language book on Ruby... One thing Ruby did immediately, though my initial Ruby bore more resemblance to Perl with a little bit less noise than typical Ruby code was that the program became more modular. It was easier to manage change and locate and repair any code break dependencies. The code model subsequently evolved, the Ruby too though more slowly.
270. to which I got myself introduced through a developer named Wookey, met at a Linux Show at the London Olympia (unable to wait any longer for Debian Woody which I had been waiting for to make the move for, well it seemed a very long time).
271. years after it was conceived as being an interesting/useful representational form to develop
272. and with very little editing, there is so little time
273. This was an attempt at the time to tread the line between telling what had been done before I was ready to share/publish it. The reason this was necessary was, this the first form of output that I was unable to provide immediate direct evidence of having achieved, not having a suitable relational database available to me at the hosting site.
274. The marketing man I met who had no interest in looking at the software, came up with incongruous statements about IBM seeing lots of fantastic technology, and buying lots that they never released... several other remarks were no more reassuring. But I got some useful terminology from him. One thing he appeared impressed by but seemed perhaps not to like was the fact that SiSU appeared to have built in to it a fairly high degree of "future proofing".
275. Which I serendipitously attended, being interested and happening to take my summer holiday a few hours away that year.
SiSU Book Samples and Markup Examples
The Wealth of Networks - How Social Production Transforms Markets and Freedom
Yochai Benkler
2006
Free Culture - How Big Media Uses Technology and the Law to Lock Down Culture and Control Creativity
Lawrence Lessig
2004
CONTENT - Selected Essays on Technology, Creativity, Copyright and the Future of the Future
Cory Doctorow
2008
Free As In Freedom - Richard Stallman's Crusade for Free Software
Sam Williams
2002
Two Bits - The Cultural Significance of Free Software
Christopher Kelty
2008
The Cathedral & the Bazaar - Musings on Linux and Open Source by an Accidental Revolutionary
Erik S. Raymond
1999
Free For All - How Linux and the Free Software Movement Undercut the High Tech Titans
Peter Wayner
2002
Cory Doctorow
2008
Free Software Foundation - FSF
GPL - GNU General Public License