You are on page 1of 9

Chapter 9

Understandability and Usability of Data

I hear and I forget. I see and I remember. I do and I understand. (Confucius) Ensuring that digitally encoded information remains usable and understandable over time is, together with authenticity, at the heart of digital preservation. The previous chapter discussed some of the formal aspects of intelligibility. This chapter discusses the complementary issue of usability of the data. Usable means capable of use (OED), available or convenient for use (www.dictionary.com). In design, usability is the study of the ease with which people can employ a particular tool or other human-made object in order to achieve a particular goal. In human computer interaction and computer science, usability studies the elegance and clarity with which the interaction with a computer program or a web site is designed (Wikipedia). Here, by usable we mean that someone is able to do something sensible with the information it contains. We recognise that this might not be easy but at least it should be possible to carry out. One could of course use a digital object simply by printing out its constituent sequences of 1s and 0s on paper and using this to decorate ones home. However it seems reasonable to suppose that this has little to do with the information content in the digital object unless of course that is what it was designed for. For example the Arecibo message [130] was designed to be understood by extraterrestrials. This consisted of a sequence of 1,679 bits, which if displayed as 73 rows by 23 columns looks like Fig. 9.1 (the shading has been added on the right to make the different parts of the image clearer). The idea is that even with no shared cultural or linguistic roots one can rely on basic counting, an awareness of prime numbers, elements, chemistry and physics which any being able to receive the message might reasonably be expected to possess. It is not clear how many human recipients could decipher the message without help!

D. Giaretta, Advanced Digital Preservation, DOI 10.1007/978-3-642-16809-3_9, C Springer-Verlag Berlin Heidelberg 2011

167

168

9 Understandability and Usability of Data

Fig. 9.1 Arecibo message as 1s and 0s (left) and as pixels both black and white (centre) and with shading added (right)

9.1 Re-Use of Digital Objects Interoperability and Preservation


One of the interesting, and indeed useful benets of following OAIS and judging digital preservation in terms of usability and understandability is that resources which are needed for preservation also produce immediate benets in terms of wider, contemporary, use of the digital objects. We justify this claim by noting that if one is familiar with a particular piece of digitally encoded information then, apart from keeping the bits, one needs nothing else. Representation Information beyond that held in ones mind is needed only where information is unfamiliar in some sense. This unfamiliarity can arise from the passage of time in which case we are in the realm of digital preservation. Alternatively unfamiliarity can arise from distance in discipline

9.1

Re-Use of Digital Objects Interoperability and Preservation

169

or experience which can apply no matter what the difference in time and is necessary for usability by a wider community. This is a very important consideration which should help to justify the expenditure of those resources in preservation.

9.1.1 Relationship Between Preservation and (Re-)Use


Preservation of digitally encoded information requires that it continues to be usable and understandable by a Designated Community. This has been extensively discussed in the previous chapters. A Designated Community is dened by the repository (see Sect. 6.2) and this denition is vital for the testability of the effectiveness of the preservation activities of the archive. However the point to realise is that the Representation Information Network can (perhaps easily) be extended to that needed by another Designated Community or perhaps more precisely, to match the Knowledge Base of some other user community, for immediate use. In other words although the digitally encoded information is not guaranteed by the repository to remain usable by these other users, by making the Representation Information required to ll the knowledge gap explicit, this is much more likely to be the case. Moreover the types of Registry/Repository(ies) of Representation Information which are described in this book will make it much easier to share the Representation Information required. The repository holding the data does not itself have to ll the gap; it needs to make it clear what the end points of the Representation Information Network it can provide are. This is not to say that everything becomes trivial. It is instructive to look at a number of possibilities. One can rst consider a single data object which may of course consist of several bit sequences (for example several les). After this the implications for combining digitally encoded information may be analysed.

9.1.2 Digital Object Used By Itself


A digital object may be used by itself, for example a user may simply want to nd a particular fact from a dataset. For the sake of concreteness let us say that (s)he wants to determine the photon counts at a certain position in the sky from data captured by a particular astronomical instrument, and that data is held in a FITS le. Other examples could include determining the character or the font used at a particular position, say the 25th character of the second paragraph of page 51, in a document. These are in many ways the simplest pieces of information which one might wish to extract from a digital object. However if one can do this then one can build up to the extraction of more complex pieces of information, using the concepts of virtualisation discussed in Sect. 7.8. The Representation Information Network (RIN) (Fig. 9.2 an annotated version of Fig. 6.4) indicates that a Java application is available to extract the numbers from

170
In principle we could use this, plus the Dictionaries in order to understand the keywords in order to extract the numbers

9 Understandability and Usability of Data


If we can run this then we can use this in a generic application to extract the numbers

FITS FILE

FITS STANDARD

FITS DICTIONARY

DDL DESCRIPTION

If we cannot run the Java Virtual Machine then we use this source code to re-write in another programming language such as C

FITS JAVA SOFTWARE

PDF STANDARD

DICTIONARY SPECIFICATION

DDL DEFINITION

DDL SOFTWARE

JAVA VM

PDF SOFTWARE

XML SPECIFICATION

UNICODE SPECIFICATION

If we can run this then we can run the Java software to extract the numbers If we cannot run this then we can use an emulator or use its RepInfo to re-create a Java VM

If we cannot run the DDL software then we can look at the DDL definition and write some software to extract the numbers

Fig. 9.2 Using the representation information network in the extraction of information from digitally encoded information (FITS le)

the data. Of course this RIN will also let us know which version of Java is needed and so forth. If the user can run the Java application then it is a simple matter to extract the number. Other options include: A. if (s)he does not have the correct version of Java at hand then (s)he at least has the option of trying to obtain it from another Registry/Repository because (s)he knows what is needed. a. An important variant of this is the use of emulators, described in Sect. 7.9. B. if the Java application cannot be run then it might be possible to take the Java source code, if available, and convert it to some programming language, say the C programming language, from which one can create an appropriate application. C. if neither (A) nor (B) are possible, then a data description language (DDL) such as EAST or DRB, together with the associated data dictionary, may be used. Again there are a number of possibilities. a. The easiest is that a generic application such as the one described in Sect. 7.3.5 can use the data description to extract the information needed. b. Otherwise one might have to read the DDL description, together with the denition of that DDL, and the associated Data Dictionary or other piece of

9.2

Use of Existing Software

171

Semantic Representation Information, and then write an appropriate application. This would no doubt be harder, but at least one would not have to guess at what information the digital object holds. Some of these options are trivial which would be very convenient for the user. However if a trivial option is not available then at least the other options are possible the information can be extracted with considerable certainty and used for other purposes.

9.2 Use of Existing Software


Option (A) above is an example of using existing software albeit probably old software. A more interesting example is the case where one wants to use information from this digital object with ones current favourite software. This may be because of the additional functionality which that favourite software provides. The additional functionality could include being able to combine that data with other data more easily. Again one can imagine that this other software may be associated with (e.g. in the Representation Information Network of) other archived data or it may be more modern software the argument applies equally. Once again one can imagine several ways of doing this and these are described next.

9.2.1 Migration/Transformation
Migration or more precisely Transformation (using OAIS terminology) involves changing the bit sequences from the original to something else. Following the recent revision of OAIS one can recognise that if this transformation is reversible then one can be condent that no information has been lost. On the other hand non-reversible transformations probably have lost information and someone must take responsibility to conrm that the transformation adequately maintains the important information. This is discussed in much more detail in Sect. 13.6. For those with an eye for recursion, the ways in which the transformation could be carried out are special cases of this sub-section, namely using a single digital object. For example one can use existing software, the subject of this sub-section, if there is software which can take in the original bit sequences in order to perform the transformation. One could alternatively use a data description language (DDL) description to extract values from the original and write them out as the new bit sequences. This could be done using generic applications as illustrated in Fig. 9.3 or else could be hand-crafted. The transformation chosen will of course be one which produces something which can be used by the software which has been chosen to deal with the

172

9 Understandability and Usability of Data

FITS FILE

OTHER DICTIONARY
FITS DICTIONARY DDL DESCRIPTION

OTHER DDL DESCRIPTION

FITS STANDARD

FITS JAVA SOFTWARE

PDF STANDARD

DICTIONARY SPECIFICATION

DDL DEFINITION

DDL SOFTWARE

OTHER DICTIONARY SPECIFICATION

OTHER DDL DEFINITION

JAVA VM

PDF SOFTWARE

XML SPECIFICATION

UNICODE SPECIFICATION

Original digital object plus data description

Data description for transformed digital object

Generic transformation software

Transformed digital object

Fig. 9.3 Using a generic application to transform from one encoding to another

information in the digitally encoded information. Authenticity evidence should of course be provided by someone, providing values and other information about selected Transformational Information Properties (also known as Signicant Properties), as discussed in Sect. 13.6.

9.2.2 Interfacing
A related but alternative way of using the digital object in ones preferred software is to use or create an appropriate programming interface. Whether or not this is possible depends upon the exibility of that preferred software for example whether or not it is possible to use plug-ins. Instead of transforming the digital object as a whole one essentially does it on the y, treating only the piece that is needed. The advantage is that one might be dealing with an object of many gigabytes, perhaps, in the case of scientic information, many terabytes (1 terabyte = 1,024 gigabytes) or even more. If one is only interested in a small part of the information then transforming the whole digital object may be a waste of effort. Being able to transform only the part that is needed can be a great saving in computation time and temporary disk storage in such circumstances.

9.4

Without Software

173

If a large number of such objects are to be dealt with, the cumulative savings could offset the effort needed to create the programming interface. With luck this may be done automatically; the alternative is to do it manually. 9.2.2.1 Manual Interfaces The manual option may be described using the data shown in Sect. 19 as an example. That data is essentially tabular. The EAST description allows one to extract individual values. It is in principle fairly easy to implement the following Java methods: public int getRowCount(); public int getColumnCount(); public Object getValueAt(int row, int column); in order to extend the AbstractTableModel class [71]. If this is done then many Java applications are available to manipulate or display the data (see Sect. 7.8.2.1.2). 9.2.2.2 Automated The automated option is the most convenient but is not often available. Essentially the manual steps above are carried out automatically. Whether or not this is possible depends, for example, on the amount and type of Representation Information available and the tools which can use them.

9.3 Creation of New Software


Entirely new software may be needed in order to adequately deal with the digitally encoded information. Techniques described in the previous sections to extract information from the digital object are applicable here. The difference is that one needs to design and implement the rest of the application, rather than having one already available. Of course what the software does is dependent on ones imagination and the requirements.

9.4 Without Software


Software is not always needed, as illustrated by the data at the start of this section, where one can imagine drawing each of the pixels by hand on squared graph paper. Pencil and paper may be all that is needed clearly this would only be practical for small amounts of data.

174

9 Understandability and Usability of Data

9.5 Software as the Digital Object Being Preserved


When software itself is the digital object being preserved all the above applies. However there are some additional considerations because to do some of what is described in the previous sections could be very complex. This is because the software which uses a software digital object is an operating system or virtual machine. The options discussed above become: A. If (s)he does not have the correct version of operating at hand then (s)he at least has the option of trying to obtain it from a Registry/Repository of Representation Information because (s)he knows what is needed. a. An important variant of this is the use of operating systems running in emulators, described in Sect. 7.9. B. If the application cannot be run then it might be possible to take the source code, if available, and port it to an available operating system or convert it to another programming language. The remaining option, of using a data description language, is not an easy one. An example of this could be a Java application, where we could argue that Java byte code is well described; this would require a re-implement a Java Virtual machine quite a daunting task. The testbed example in Sect. 19 provides further examples of using the Representation Information Network for software.

9.6 Digital Archaeology, Digital Forensics and Re-Use


The above starts from the assumption that Representation Information is available, as should be the case where digitally encoded information is being adequately preserved. There are times when one is not in such a fortunate position, for example where one nds some digital data but does not know much about it. In such a case one may be able to nd the format (i.e. the appropriate Structure Representation Information), as discussed in Sect. 7.4. What will be much more difcult to do is to nd the semantics associated with it. For example one may be able to discover that a le is a PDF. This allows one to render the contents of the le. This does not mean that one understands or can use the information it contains for example the rendered text might contain a string of 1s and 0s, as described at the start of this chapter, or it might be in some unknown language. In some cases this has not been an insuperable problem an analogy may be drawn with the interpretation of cuneiform but this can take a considerable amount of time and effort. Therefore this is a method of last resort.

9.8

Summary

175

9.7 Multiple Objects


Dealing with multiple pieces of digitally encoded information introduces more complexity but the essential concepts have been coved therefore no more will be said.

9.8 Summary
Although not providing all the details, it is hoped that this chapter will have provided the reader with an understanding of how digital objects may be used and re-used over the long-term. Examples of some of these are provided in Part II. It may not be a trivial process but, if the right Representation Information has been collected then at least it should be possible. It should also be clear that the formal description techniques offer the possibility of making re-use easier for the future users.

You might also like