Multilingual Document Search Sample (the old Live Manual from debian-live, the original branch that became part of open-infrastructure)

Available Languages (some partially translated)

Note: these are markup samples only and not up to date documentation for Live-Manual.

Sample Academic Legal Writings

(a sample of roughly 400 scholarly writings on the CISG prepared for Prof. Albert Kritzer at the CISG Database, Institute of International Commercial Law, Pace University in 2007 (which currently hosts over 1500 such writings))

SiSU, a description

SiSU for documents - structuring, publishing in multiple formats and search.

SiSU is a lightweight markup based, command line oriented, document structuring, publishing and search, static content tool for document collections.

With minimal preparation of a plain-text (UTF-8) file, using sisu markup syntax in your text editor of choice, SiSU can generate various document formats, most of which share a common object numbering system for locating content, including plain text, HTML, XHTML, XML, EPUB, OpenDocument text (ODF:ODT), LaTeX, PDF files, and populate an SQL database with objects (roughly paragraph-sized chunks) so searches may be performed and matches returned with that degree of granularity. Think of being able to finely match text in documents, using common object numbers, across different output formats and across languages if you have translations of the same document. For search, your criteria is met by these documents at these locations within each document (equally relevant across different output formats and languages). To be clear (if obvious) page numbers provide none of this functionality. Object numbering is particularly suitable for "published" works (finalized texts as opposed to works that are frequently changed or updated) for which it provides a fixed means of reference of content. Document outputs can also share provided semantic meta-data.

SiSU also provides concordance files, document content certificates and manifests of generated output and the means to make book indexes that make use of its object numbering.

Syntax highlighting and folding (outlining) files are provided for the Vim and Emacs editors.

Dependencies for various features are taken care of in sisu related packages. The package sisu-complete installs the whole of SiSU.

Additional document markup samples are provided in the package sisu-markup-samples which is found in the non-free archive. The licenses for the substantive content of the marked up documents provided is that provided by the author or original publisher.

SiSU uses utf-8 & parses left to right. Currently supported languages: am bg bn br ca cs cy da de el en eo es et eu fi fr ga gl he hi hr hy ia is it ja ko la lo lt lv ml mr nl nn no oc pl pt pt_BR ro ru sa se sk sl sq sr sv ta te th tk tr uk ur us vi zh (see XeTeX polyglossia & cjk)

SiSU works well under po4a translation management, for which an administrative sample Rakefile is provided with sisu_manual under markup-samples.

take two

SiSU may be regarded as an open access document publishing platform, applicable to a modest but substantial domain of documents (typically law and literature, but also some forms of technical writing), that is tasked to address certain challenges I identified as being of interest to me over the years in open publishing.

The idea and implementation may be of interest to consider as some of the issues encountered and that it seeks to address are known and common to such endeavors. Amongst them:

* how do you ensure what you do now can be read in decades?

* how do you keep up with new changing and technologies?

* do you select a canonical format to represent your documents, if so what?

* how do you reliably cite (locate) material in different document representations?

* how do you deal with multilingual texts?

* what of search?

* how are documents contributed to the collection?

(these questions are selected in to help describe the direction of efforts with regard to sisu).

My Dabblings in the Domain of Open Publishing ---------------------------------------------

The system is called SiSU, it is an offshoot of my early efforts at finding out what to make of the web, that started at the University of Tromsø in 1993 (an early law website Ananse/ International Trade Law Project / Lex Mercatoria). I have worked on SiSU continually since 1997 and it has been open source in 2005 (under a license called GPL3+), though I remain its developer.

In working in this field I have had to address some of the common issues.

So how do you ensure what you do now can be read in decades to come? There are alternative solutions. (i) stick with a widely used and not overly complicated well document open standard, and for that the likes of odf is an excellent choice (ii) alternatively go for the most basic representation of a document that meets your needs, in my case based on UTF-8 text and some markup tags, fairly easily parsable by the human eye and as long as utf8 is in use it will always be possible to extract the information

How do you keep up with new changing and technologies? Here my solution has been to generate new versions of the substantive content so as to always have the latest document representations available e.g. HTML has changed a lot over the years, different specifications come out for various formats including ODF, electronic readers have become an important viewing alternative, introducing the open reader format EPUB. Output representations are generated from source documents. Different open document file formats can be produced and databases and search engines populated. (The source documents and interpreter are all that are required to re-create site content. Source documents can be made public or retained privately). The strict separation of a simple source document from the output produced, means that with updates to SiSU (the interpreter/processor/generator), outputs can be updated technically as necessary, and new output formats added when needed. Amongst the output formats currently supported are HTML, LaTeX generated Pdfs (A4, letter, other; landscape, portrait), Epub, Open Document Format text. Returning to HTML as an example, it has changed a lot over the years I have worked with it, this way of working has meant it is possible to keep producing current versions of HTML, retaining the original substantive document... and new formats have been added as thought desired. There is no attempt to make output in different document formats/ representations look alike let alone identical. Rather the attempt is to optimize output for the particular document filetype, (there is no reason why an epub document would look or behave like an open document text or that a Pdf would look like HTML output; rather PDF is optimized for paper viewing, HTML for screen etc.) Wherever possible features associated with the particular output type are taken advantage of. This freedom is made possible to a large extent by the answer to the question that follows.

How do you reliably cite (locate) material in different document representations? The traditional answer has been to have a canonical publication, and resulting fixed page numbers. This was not a viable solution for HTML (which changes from one viewer to another and with selectable font faces & size etc.); nor is it otherwise ideal in an electronic age with the possibility of presenting/interacting with material/documents in so many different ways. Why be so restricted? Here my solution has been "object citation numbering". What the various generated document formats have in common is a shared object numbering system that identifies the location of text and that is available for citation purposes. Object numbers are: sequential numbers assigned to each identified object in a document. Objects are logical units of text (or equivalent parts of a document), usually paragraphs, but also document headings, tables, images, in a poem a verse etc. [In an electronic publishing age are page numbers the best we can come up with? Change font type, font size, page orientation, paper size (sometimes even the viewer) and where are you with them? And paper though a favorite medium of mine is no longer the sole (or sometimes primary) means of interacting with documents/text or of sharing knowledge]

What object numbers mean (unlike page numbers) is e.g.

* if you cite text in any format, the resulting output can be reliably located in any other document format type. Cite HTML and the reader can choose to view in Epub or Pdf (the PDFs being an independent output, generated by book publishing software XeTeX/LaTeX).

* if you do a search, you can be given a result "index" indicating that your search criteria is met by these documents, and at these specific locations within each document, and the "index" is relevant not only for content within the database, but for all document formats.

* if you have a translated text prepared for sisu, then your citations are relevant across languages e.g. you can specify exactly where in a Chinese document text is to be found.

* generated document index references & concordance list references etc. are relevant across all output formats.

What of search? For search, see the implications of object numbers for search mentioned above. The system currently loads an SQL server (Postgresql) with object sized text chunks. It could just as well populate an analytical engine with larger sections or chapters of text for analytical purposes (such as the currently popular Elasticsearch), whilst availing itself also of the concept of objects and object numbers in search results.

How do you deal with multilingual texts? If you have translated text prepared for sisu, then your citations are relevant across languages. Object numbers also provide an easy way to compare, discuss text (translations) across languages. Text found/cited in one language has the same object number in its translations, a given paragraph will be the same in another language, just change the language code. (documents are prepared in UTF-8, current language restrictions are: through use of LaTeX tools, Polyglosia & CJK (Chinese, Japanese & Korean), and from the fact that sisu parses left to right)

How are materials prepared for contribution to the collection? (a) The easiest solution if the system allows is for submission in the format in which work is authored, usually a word processor, for which odf may be a decent selection. (b) I have stuck with enhanced plaintext, UTF-8 with minimal markup. Source documents are prepared in UTF-8 text, with a minimalist native markup to indicate the document structure (headings and their relative levels), footnotes, and other document "features". This markup is easily parsable to the human eye, and plays well with version control systems. Documents are prepared in a text editor. Front ends such as markup assistants in a word processor that can save to sisu text format or other tool whist possible do not exist. [(c) yet another form of submission for collaborative work are wikis which have shown their strength in efforts such as Wikipedia.]

The system has proven to be a good testing ground for ideas and is flexible and extensible. (things that could usefully be done: apart from a front end for simpler user interaction; feed text to an analytical search engine, like Elasticsearch/Lucene; it still needs a bibliography parser (auto-generation of a bibliography from footnotes); and it might be useful to allow rough auto translation documents on the fly by passing text through a translator (such as Google translate)).

In any event, my resulting technical opinions (in my modest domain of action) may be regarded as encapsulated within SiSU []

git clone git://;a=summary (there are additional commits in the upstream branch) git clone git:// Development work is on Linux and the easiest way to install it is through the Debian Linux package as this takes care of optional external dependencies such as XeTeX for PDF output and Postgresql or Sqlite for search.

