2.1 – File formats
0
Aid information can be published in a wide variety of different file formats – from word processor document formats to database files formats. The choice of file format can affect how easy it is to re-use the information encoded in it.
2.1.1 – Proprietary, non-proprietary and open file formats
0
On the one hand some file formats are non-proprietary and open, which means they can be used or implemented by anyone with little or no restriction. Prominent examples include HTML/XHTML, OpenDocument, PDF, TXT, XML 1.
0
On the other hand some file formats are proprietary, which means that there may be restrictions on how the format may be used, and certain software packages may be required to read the files. Prominent examples include MPEG Audio Layer 3 (MP3), Windows Media Video (WMV), Microsoft Word (DOC/DOCX) and Microsoft Excel (XLS/XLSX) 2.
2.1.2 – Machine readability
0
While some file formats present data in a way which is ‘machine-readable’, some file formats are primarily meant to be read by people.
0
For example, a table of financial information might be published in XML, Comma Separated Value (CSV) or Microsoft Excel (XLS) formats, which can be easily graphed, analysed, aggregated with other data or converted into other formats – as the rows and columns can be read by the computer. Alternately the same table could be published in a PDF or Microsoft Word Document. In this case the material in the tables would have to extracted – either by hand or by using a computer program (commonly called ’screen scraping’). Also data may be broken up into multiple tables and scattered throughout a document with explanatory notes – in which case the underlying data would have to be extracted and pieced back together.
0
It is worth noting that whether or not a file format is open is a separate issue from whether or not it is machine readable. For example, PDF files are open – but the file format is mainly orientated towards printing and layout, not for their contents to be extracted, revised and/or re-used after publication. Excel files are proprietary but they are machine processable and are much more useful than PDF files when it comes to analysing, visualising or linking together their contents.
0
Following is a list of common file formats for text and data along with details on whether or not they are machine readable, whether the specification is available and whether or not they are open 3:
|
File format |
Machine readable? |
Specification available? |
Open? |
|
Plain Text (.txt) |
✔ | ✔ | ✔ |
|
Comma Separated Value (.csv/.txt) |
✔ | ✔ ** | ✔ |
|
Hyper Text Markup Language (.html/.htm) |
✔ | ✔ | ✔ |
|
Extensible Markup Language (.xml) |
✔ | ✔ | ✔ |
|
Resource Description Framework (.rdf) |
✔ | ✔ | ✔ |
|
Open Document Format (.odt, .ods, etc) |
✔ | ✔ | ✔ |
|
Portable Document Format (.pdf) |
✘ | ✔ | ✔ |
|
Microsoft Word (.doc/.docx) |
✘ | ✔ | ✘ |
|
Microsoft Excel (.xls/.xlsx) |
✔ | ✔ | ✘ |
0
** Though there is not a official standard specification for CSV, many informal specification documents exist.
2.1.3 – Formats for the semantic web
0
There are a variety of formats designed to make it possible for computers to analyse the contents of a file. For example, Resource Description Framework (RDF) allows resources to be described in such a way that computers can sort and query their contents.
0
So while a encyclopedia might contain an ordinary sentence such as “Paris is the capital of France” which means nothing to a computer, an equivalent statement in RDF could express that ‘Paris’ is the name of a capital city, ‘France’ is the name of a country, and that the first ‘is a capital of’ the second. Hence, RDF allows the creation of structured relationships between entities that computers can parse and query – rather than unstructured text that the computer can do much less with. As a concrete example, projects such as DBPedia attempt to extract structured information from Wikipedia to allow users to make sophisticated queries such as:
0 [...] soccer players with number 11 (on their jersey), who play in a club whose stadium has a capacity of more than 40000 people and were born in a country with more than 10 million inhabitants. 4
2.1.4 – Which formats?
0
It may be difficult to know which formats will be most useful in the long term – and hence it may be undesirable to be prescribe a single format which may be popular or widespread at one point, but superseded in the future. This may incur unnecessary costs to organisations publishing the material, and could require expertise that is unavailable.
0
When it comes to re-using aid information, it is crucial that the data is machine readable and that there are no technical obstacles to re-using it. If it is under an open license, then others can republish the same material in different formats. If the file format specification is publicly available, then there is less risk that prosective re-users will be required to use a particular piece of software, or that, in the worst case scenario, the format will become obsolete and unreadable without software that is no longer supported.
0
While publishing aid information in formats such as RDF and JSON would be desirable, not all organisations will have the expertise to do this. Hence we would not recommend that the information publishing standard should require it.
1
We suggest that the standard should require that aid information is published in one of a small number of basic formats for text and data such as TXT and CSV, and recommend (optionally) that it is also published in other more recent formats such as RDF, JSON and so on. Crucially it should require that all formats are machine readable.
0
-
0
Both Word and Excel file formats are included on the Microsoft Open Specification Promise published in September 2006 – which is a promise not to assert legal rights over certain formats. ↩ -
0
This is based on a registry of formats from the Information Accessibility Initiative. ↩ -
2
Sören Auer, Jens Lehmann, What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content (PDF), p.11. In Franconi et al. (eds), Proceedings of 4th European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, June 3-7, 2007, LNCS 4519, pp. 503–517, ISBN 978-3-540-72666-1, Springer, 2007. ↩
Good day! I just would like to offer you a big thumbs up for your great info you’ve got right here on this post. I’ll be coming back to your web site for more soon.
Hello, its nice paragraph concerning media print, we all be aware of media is a wonderful source of facts.