Comparison of Two Approaches of Managing Links in Multiple Versions of Documents


Chuen Y. Lo

Department of Computer Science
Royal Melbourne Institute of Technology
124 La Trobe Street, Melbourne, Victoria 3000
Australia
cyl@kbs.citri.edu.au

1) Introduction

Links and versioning are two important aspects of document management. Links represent inherent associations of content and structure of texts. Efficient management of links allows convenient cross referencing in information browsing. Versioning is essential to history-keeping of a document. It allows evolutionary information and states of this document be captured so that future references are possible. An integrated strategy to implement both areas is thus important to efficiently maintain links among multiple versions of multiple documents.

Nevertheless, the practice of versioning links does not seem to be widely addressed, though there is much work dedicated to other versioning aspects such as temporal databases [8], temporal indexes [2] and temporal data management [7]. Therefore in this paper two approaches for versioning links are presented. Based upon a specific versioning mechanism the performances of both approaches are compared.

2) The Context

The two methodologies presented are based on the context of SGML environment. SGML (Standard Generalised Markup Language) was adopted by ISO [6] as an international standard to describe the structure of electronic documents. The reason of using SGML is its international acceptance as an electronic document markup standard. Furthermore, while the description of a document's structure is primarily applied in publication, database technology could also make use of this structural knowledge to enhance its management of documents [1]. In particular, SGML structures can be utilised to implement links [3,9].

Link sources and destinations can be defined by SGML tags. SGML placed tags in text to denote its structure and such practice is known as descriptive markup. A component in the text such as a title or a paragraph can be explicitly defined by marking them with these tags. Thus, a location in the text which is capable of acting as a link source or a link destination can be similarly marked with tags. The presented methodologies refer to links and destinations marked-up in this fashion.

Figure 1 shows an example used to illustrate the concepts presented throughout the paper. A and B are two separate documents, and each has three versions, created as shown in the time line. a, c and e are sources of links, and b, d, f and g are destinations. Three types of links are identified: intra-version link are links with both the source and destination located in the same version, eg the link ab; inter-version links are links with both the source and destination located in the same document but different versions, eg the link ef; and inter-document links are links with both the source and destination located in different documents, eg the link cd.

Each destination is given an identifier unique to the version. For example, d and g in both [B 1] (Document B Version 1) and [B 2]. Globally unique identifications are then derived by combining the document identifier (DocId) and the version number (VerNo) with the destination identifier (DestId), for example, [B 1 d].

Destinations that share the same identifier within the same document but different versions, for instance, [B 1 d] and [B 2 d], represent the same textual location throughout those versions. The identifier of any deleted destination will not be re-used in later versions; and newly added destinations will be given incremented identifiers.

To limit the scope of the problem, it is decided that the creation of links in versioning is not addressed in this paper. Also, link sources refer to explicitly defined destinations only. Although a section, a paragraph, or even a document in itself can be a destination, the referencing of these components requires certain location addressing schemes such as those provided by the HyTime standard [5]; and thus is not considered in this paper. In addition, it is determined that dangling references are not allowed, and it is an error for a source to refer to non-existing destinations.

3) The Model of Links and Versions

There are many possibilities how links could be versioned. One reasonable way is as shown in Figure 1. The three types of links are considered in this model as follows. An intra-version link is inherited in the new version if both the source and the destination of that link are inherited intact, for example, link ab in both [A 1] and [A 2].

An inter-version link is always initiated from a newer version to an older version, for example, link ef. The prohibition of dangling references prevents a link created from a current version to a future version; and common versioning practice prevents modifying a committed version so it is not possible to create a link source in the content of the older version to refer to any destination. This practice also confines the locations to which an inter-version link can point to because the destinations must have already been defined and exist in the older version for a newer version to refer to them.

An inter-document link, in this model, always points to the destination in the possible latest version. It treats destinations pertaining to the document, not the version of the document. Consider the example: at time 1 and time 2 [B 1] and [A 1] were created respectively. [A 1] had one intra-version link ab and one inter-document link cd. At time 3 [B 2] was created. The link versioning mechanism works this way: the source [A 1 c] was changed to refer to the destination [B 2 d] and at the same time the old pointer to [B 1 d] was discarded. Thus in this way the source remains pointing to the destination in the possible latest version, though not necessarily the current version.

4) Direct Reference Method

This method specifies unique identifiers (Id) for referent elements (destinations); and directional links from the reference elements (sources) can be established by making references (Ref) to the formers' identifiers. This is basically the ID-IDREF referencing method of SGML. The difference is that SGML ID-IDREF method can only address Refs within the same piece of text. This is adequate for intra-version links only. The Ids used by the Direct Reference Method must thus be augmented by either DocId or VerNo to identify the destination.

A destination address table would certainly be useful to this method. In this destination address table, the offset of all the link destinations in all the documents are recorded. Each time a source is triggered, the table is looked up to locate the corresponding destinations for traversal. Whenever a new version is created, the new piece of text is scanned and the offsets of all the destinations within the text are appended (for new destinations) or changed (due to the change in position). The old information is left there unchanged to support traversal to older versions of documents.

Additionally an inverted file is required in which the sources that refer to each destination in the whole document collection are listed. For example, correspond to Figure 1, [B 1 d] would be recorded as being referred by [A 1 c] at time 2. The purpose of this inverted file is to backtrack the references that point to a given destination. When a new version, say, [B 2] is created, it is necessary to know that [B 1 d] is referred by [A 1 c] and hence [A 1] should be retrieved and has its c updated to refer to the new [B 2 d]. Without this inverted file, there is no way to update the sources in the referring documents.

5) Link-Objects Method

This method set up a global link table and maintain links as objects, like the concept of the Dexter Hypertext Model [4]. The global link table stores the information of all the links available in the document base. Each entry in the global link table consists of a globally unique identifier for the link (LinkId), and the DocId, VerNo, DestId, and Offset of the destination. Sources in documents thus make references to the LinkIds instead of the DestIds to refer to the destinations. This can be viewed as an indirect reference method. New LinkId is derived by incrementing the currently largest LinkId.

Consider the example in Figure 2, Link #1 was created for link ab when [A 1] was created, and so was Link #2, pointing from [A 1 c] to [B 1 d]. When [B 2] was created at time 3, Link #2 was updated to point to [B 2 d] and the pointer to [B 1 d] was discarded.

As [A 2] was created, the link ab was inherited and so another link object was created to denote this new inherited link (Link #3). The inheritance of the Ref (pointing to Link #2) at [A 2 c] automatically make it point to [B 2 d] so no special update was required. At the same time, Link #4 was created to denote link ef. At time 5 [A 3] was created, and link ef was inherited. Unlike intra-version links, inter-version links refer to absolute destination in a specific version, and so the inheritance of the Ref (pointing to Link #4) at [A 3 e] automatically make it point to [A 1 f] and hence no special update was required. As the Ref at [A 3 c] was changed to [B 2 g], a new link object was created to denote this new link. The old Link #2 that points to [B 2 d] was left unchanged to allow traversal from an older version to a "historical" destination.

6) Comparison

The Direct Reference Method works well if versioning is not taken into consideration. In the aforementioned versioning model, when a new version of VerNo n, Vn, is created, documents that have sources refer to its previous version Vn-1 need to be updated to refer to the new Vn instead. For example, [A 1 c] was updated to refer to [B 2 d] instead of [B 1 d] when [B 2] was created. This modification involves locating the sources in the inverted file, searching for the corresponding documents, and updating the link information in those documents. Such an inconvenience is intensified when there are huge amount documents involved.

Another inconvenience is the maintenance of the inverted file. Each time a version is created, a full-text scan is required so that all Ids referenced in the new version will have their entries updated in the file. Also, this inverted file can cause large amount of space overheads.

The main advantage that the Link-Objects Method has over the Direct Reference Method is that there is no need to access the referring documents to update their sources. All updates can be done to the single global link table. For example, when [B 2] was created, it was only required to change the reference in Link #2. All documents that refer to Link #2 for destination d would have their references automatically changed to [B 2 d] without other additional updates. This has the benefits of keeping the updating operation centralised to one single target. Further, the number of necessary update operations will also be smaller since multiple sources which address to the same link identifier require only one single update operation.

Nevertheless, this approach suffers from one major drawback: each time a link traversal is requested this global link table has to be accessed to locate the destinations of the traversal. Obviously the performance must be fast enough so as not to make the approach impractical.

Both methods introduce space overheads due to the external tables employed. As a theoretical estimation, the global link table of the Link-Objects Method should consume less space than the inverted file of the Direct Reference Method. Depending on the amount of sources and their number of references the size of the inverted file can vary a lot. The more sources refer to a particular destination, the larger the inverted file. On the other hand, the global link table does not have this problem since a link can be re-used (eg Link #2) by more than one source. Instead of listing the identification for all sources referring to a destination, one LinkId would suffice and thus save a lot of space.

7) Conclusion

The Link-Objects Method of maintaining links as objects is apparently an intended alternative to tackle the updating difficulty encountered in the Direct Reference Method. Certainly the major disadvantage of frequent global link table lookup for link traversal requests makes an efficient index on this table extremely important. It is nevertheless believed that this approach has a good potential to perform better than the Direct Reference Method.

There are other issues such as the addition of links in a newer version and references to components within the text that are not marked up as destinations, have not been addressed in this paper. It is also important to tackle these problems in order to present a more detailed link versioning mechanism.

Currently a prototype that handles links and versioning of documents is being implemented on Structured Information Manager (SIM), a document management database developed by Collaborative Information Technology Research Institute (CITRI), Melbourne. Further investigation will be conducted to explore the issues of versioning links in document database.

Acknowledgements

Many thanks to Dr Ross Wilkinson and Dr Alan Kent for their invaluable help and advice.

References

[1] P. W. Chan, A. Kent, and R. Wilkinson. Managing Software Engineering Documents. In Proceedings of the Sixteenth Australian Computer Science Conference, Brisbane, Australia, 1993.

[2] R. Elmasri, G. T. J. Wuu, and Y. J. Kim. The Time Index: An Access Structure for Temporal Data. In Proceedings of the 16th Very Large Database Conference. Brisbane, Australia, 1990.

[3] E. Fahmy, and D. T. Barnard. Adding Hypertext Links to an Archive of Documents. The Canadian Journal of Information Science, 15(3):25-41, September 1990.

[4] F. Halasz and M. Schwartz. The Dexter Hypertext. Communications of the ACM, 37(2):30-39, February, 1994.

[5] International Organization for Standardization and International Electrotechnical Commission. Information Technology - Hypermedia/Time-based Structuring Language (HyTime), International Standard ISO/IEC DIS 10744:1992(E), 1992.

[6] International Organization for Standardization. Information Processing --- Text and Office Systems --- Standard Generalised Markup Language (SGML), International Standard ISO/IEC DIS 8879:1986, 1986.

[7] A. Shoshani and K. Kawagoe. Temporal Data Management. In Proceedings of the 12th International Conference on Very Large Data Bases, Kyoto, 1986.

[8] R. Snodgrass and I. Ahn. A Taxonomy of Time in Databases. In Proceedings of ACM SIGMOD 1985 International Conference on Management of Data, Austin, 1985. ACM SIGMOD.

[9] E. Wilson. Converting an SGML Text to Hypertext. Technical Report TEI TR3 W6 from University of Kent at Canterbury.