Revisiting archival description – LOD-LAM session idea
Posted on May 30, 2011
Apologies for the brevity of this blog post – I’m keeping this brief to make sure I get it posted before LOD-LAM.
So, archival description.
Archival records are hard to find. They’re often in large bodies of records, difficult to browse through and generally less cut-and-dry than publications which are intended for formal publication and/or public consumption. Archival finding aids are the researcher’s traditional first point of contact, providing background biographical information on the organization and/or personal creator(s), as well as a description of how the records are arranged and description of the various levels of organizational hierarchy. They’re useful!
But they’re also a bit old-fashioned, at least as typically implemented. The finding aid structure imposes a few issues for linked open data applications.
I see two[1] major problems with current archival description:
- They’re hierarchical
Most countries’ archival description standards are based on a strict hierarchy from higher levels of description (fonds, etc.) to more precise levels of description (series, sub-series, file, item) with fairly rigidly prescribed relationships between items. The finding aid also assumes a “paper” whole-body approach, rather than a linking approach. This is kind of non-webby, and imposes a stricter order on documents than their creators may have had, in many cases.
(The Australians, of course, are a few steps ahead of the rest of us already.)
Perhaps even more though, a major problem is that:
- They’re imprecise.
This is the real issue, or at least the most immediate issue. Archival descriptions are designed for human eyes in a paper world, and so they’re often encoded with a level of ambiguity that’s difficult for machines to extract. (LOCAH has been doing a great job of identifying points of concern and trying to route around them.)
Archival descriptions have some inherent ambiguity because interpretation of archival holdings is not always cut and dry, but that doesn’t mean that we have to be ambiguous in how we create those descriptions. We can be precise about the ways in which our collections are ambiguous.
I’d love to get a conversation going about revising descriptive standards to enhance precision in finding aids in order to enhance the ability to use them as computer-readable metadata. I can see a number of areas for improvement:
- More strongly-typed data fields, rather than “fuzzy” fields that can hold a variety of types of subjectively-defined data
- More focus on “globally-scoped” names rather than “locally scoped” (as pointed out by Pete@LOCAH here)
- A stricter, clearer inheritance model rather than ISAD(G)’s rule of non-repetition (Thanks to Pete again)
- Certainly more, which we can talk about at LOD-LAM!
The extent to which all this can be implemented will depend on the organization, of course – retrofitting older archival descriptions for all of this would be time-consuming, if practical at all. But I think there are a lot of benefits to be gained by changing practices going forward, and I see this as an enhancement to current descriptive standards/practices that can benefit more than just linked open data applications.
[1] Probably more, but for now I’ll focus on these two.
» Filed Under Description, LOD-LAM, Metadata | Leave a Comment
Sigma SD1 – update
Posted on May 30, 2011
Heartbreak :’(
http://dpreview.com/news/1105/11052010sigmasd1.asp
» Filed Under Cameras, Digitization, Scanner building | Leave a Comment
Hiatus
Posted on March 27, 2011
Apologies for the long hiatus. I didn’t announce it on the blog, but I began a new job in January and moved to another province to take it.
I plan to get back to posting in the next bit. It will be awhile until things get set up, but I promise I’ll have cool stuff to share – and I’ll get back to posting digitization tutorials in the near future as well.
» Filed Under Uncategorized | Leave a Comment
Tech to watch out for – Sigma SD1
Posted on October 18, 2010
Photokina, the world’s biggest photography exhibition, took place last month and I was eagerly scanning the headlines for good book scanning news. It may not have been the biggest news of the show for photography buffs, but I was very excited to see the announcement (via DPReview) of Sigma’s new SD1 camera. I’ve been holding out posting in hopes that more concrete details or sample images would show up, but since it looks like there may not be any news for a few months I decided to go ahead.
The SD1 is the newest camera Sigma has released using the Foveon technology, which is something interesting but which hasn’t seemed quite there yet before now. The SD1 is the first time Foveon has been competitive with traditional cameras, and could mean substantially better colour reproduction than is possible right now.
Why this could be so good takes a little explanation. A computer monitor treats colour by mixing together primary colours; every pixel on your monitor has three lights, representing red, green and blue (RGB). Essentially all cameras, on the other hand, use something called a “Bayer array” on their sensor. Instead of capturing the primary colours as a computer works with, they capture a single colour (red, green or blue) for each pixel. When the data is processed to produce the photo image, the colour information of each pixel is then averaged with the adjacent pixels to produce a full set of RGB values for each pixel. This means that the image has lightness, or luma, resolution for each pixel but a lower colour, or chroma, resolution because each pixel has only one true colour value.

Bayer sensor pixels. Each pixel is coloured to show the colour it detects. Illustration from Wikipedia.
Foveon works in a very different way. Instead of layering pixels flat on a single sensor, it uses three layers, each sensitive to one primary. The “stacks” of pixels are associated with each other in the same way that the three colour sub-pixels in a computer monitor pixel make up one single pixel, which means that every pixel has three true colour values and a chroma resolution equal to its luma resolution.
The Sigma SD1 marks the first time that Sigma’s cameras really have the chance to wow people. Their previous highest resolution sensors have been 4.7 megapixels, which meant that the theoretical advantage of the technology was a bit moot in the face of the competition’s raw number of pixels. The SD1, on the other hand, is a 15.4 megapixel sensor which means that it has the potential to compete on detail with the best cameras currently available if it delivers on Sigma’s promises.
My understanding is that Foveon-based technology has been used in industrial applications for some time now, but a high-resolution Foveon sensor for consumer use would make the first time it’s usable for those interested in building book scanners. If colour accuracy isn’t compromised, this is very promising. I’m keeping an eye open, and I’ll report on any updates.
» Filed Under Cameras, Digitization | Leave a Comment
Understanding DPI
Posted on October 18, 2010
One question that comes up a lot on the DIY Book Scanner forums is how to calculate DPI for documents scanned using a camera, and a lot of people struggle trying to correlate megapixels on the cameras they’re buying to the resolution they can expect to get in their final images. I’ve put together a short guide to help make sense of it.
DPI, or dots per inch, is the way that resolution is usually measured for archival images. You’ve probably heard of standards like 300 DPI or 600 DPI, but what do those mean? Instead of describing the resolution of images in terms of the raw number of pixels, DPI represents resolution in term of the number of pixels relative to the size of the original physical item. In other words, 300 DPI means that each square inch of the original item is represented by 300 pixels. The usual archival standard for imaging is 300 DPI or 600 DPI; 300 DPI is sufficient to reproduce the item at its original size at a high quality, while 600 is sufficient for enlargements. Very few items have more than 600 DPI’s worth of detail.
This is where camera scanners are different from traditional flatbed scanners. A flatbed scanner advertises a specific DPI, while limiting the size of the items that can be scanned to what can fit on the glass. Cameras, on the other hand, provide just a number of pixels with no limitation on the size of the item. The DPI depends as much on the scanning cradle and other equipment as it does on the camera.
Estimating the DPI a camera can theoretically provide is easy if you know the resolution it captures, however. Dividing the number of horizontal pixels the camera provides by the horizontal length of the item to be scanned provides a maximum theoretical resolution to the item. For instance, a 12 megapixel camera whose horizontal width is 4032 can provide a DPI of 366 DPI when scanning an 8.5 x 11″ sheet of paper (4032 pixels divided by 11 inches). However, keep in mind that there will always be at least a small border of empty space around the page so that you will never quite achieve that maximum resolution.
Measuring the resolution in a captured image works in basically the same way. Say that in the same sample image, the 8.5 x 11″ sheet of paper takes up 3000 of the horizontal pixels. The actual resolution would be 272 DPI, because 3000 pixels divided by 11 inches is 272.
You can see how it works in this sample image.
This photo shows a 6 inch by 9 inch book page. This preview is scaled down, but the real page takes up 2552 pixels – divided by 9 inches, that means a DPI of 283.
One last piece of advice. When going out to buy a camera, make sure that you have a good idea of the sizes of items that you will be scanning. Nothing can tell you what equipment you need better than figuring out your needs first!
» Filed Under Cameras, Digitization, Scanner building | 2 Comments
Camera shopping guide
Posted on September 3, 2010
As promised in my last post about book and document scanner building, I’ve put together a little guide to choosing a camera. The camera is really the most important and expensive part of a DIY book scanner, and it’s most responsible for image quality of all the elements you have to choose from. The guides I’ve seen on camera selection are mainly aimed at home users looking to scan their own books to use on portable devices, so I’m writing this guide primarily for archival scanning.
This post is going to be very dense in information, but I’ve included an easy to read summary at the bottom.
Inexpensive point-and-shoot cameras – $120 and under
The most popular cameras at the DIY Book Scanner forums are relatively low-end Canon PowerShot cameras which support special software called CHDK. The CHDK software enables additional features, including remote control, raw file format, and advanced programming functions. These cameras are perfect for that site’s audience of hobbyists, but not necessarily for archivists. They can capture pure text pages in sufficient quality for OCR, but are not of a high enough quality to produce a genuine image of the original page. The most popular of these cameras was traditionally the 8 megapixel PowerShot A590, discontinued but widely available for $120 and less, while more recently the 10 megapixel PowerShot A480 has also been popular.
Low-end SLR-style cameras – $500 – $700
The next step up from compact cameras are mirrorless cameras – cameras which provide image quality comparable to professional SLR cameras. These are generally equipped with all of the features necessary for book scanning, although they often have fewer capabilities for remote control and computer programmability than Canon and Nikon SLRs. The two major competitors in this field at the moment are the Micro Four Thirds standard and Sony’s NEX.
The best options for book scanning are the Olympus PEN E-PL1 (12mp, $600), Olympus PEN E-P1 (12mp, $900 list but usually less), Panasonic DMC-G10 (12mp, $700) and the Sony NEX-3 (14mp, $700). The recent launch of Sony’s NEX-3 and NEX-5 has resulted in intense price competition in this field, so many of these cameras can be bought for significantly less than these list prices – look for cameras like the E-P1, E-PL1 and NEX-3 at close to $500, depending on when you’re shopping. Note however that the NEX-3/5 and E-PL1 do not support remote control.
I’ve differentiated this from the mid-range category below, but there’s actually some overlap in pricing with SLRs for the time being while the market sorts out where exactly mirrorless cameras sit.
Mid-range SLRs – $650 – $1000, with deals down to $500
SLR cameras have traditionally been the most popular document scanning cameras, both because of their image quality and because Canon and Nikon cameras are remotely controllable and programmable using computers. In the mid-range price point, the best buys for document photography are the Canon EOS Rebel T2i (aka 550D in Europe), which provides 18mp for a list of $1000, and the recently-announced Nikon D3100, which provides 14.2mp for a fairly thrifty list of $700. Depending on your needs, it’s also worth keeping an eye out for last generation’s models – the Canon T1i/500D (15mp) and Nikon D3000 (10mp) can be bought for close to $500 in some places as inventory is cleared out, which makes them possibly the best buy at this point in time. Canon also continues to offer their 2008 model Rebel XS (10mp, $580), which is unexciting but cheap.
At this point I’m skipping several price tiers in SLR offerings. After the entry-level offerings, most SLRs are differentiated not by their image quality but by other features that have no real bearing for document photography. Canon’s T2i, for example, uses an identical sensor with identical imaging performance to the Canon EOS 7D despite a price difference of nearly a thousand dollars. This comparison at DXOMark shows that both cameras have essentially identical scores.
Full-frame SLRs – $3600 and up
The biggest price jump at this point is to “full-frame” SLRs, which have significantly larger sensors than the cameras in the previous categories. This gives them better imaging quality and a higher resolution. Canon’s best offering for document scanning is probably the 5D mk II (21mp, $3600 with lens). Again, while Canon offers a more expensive full-frame SLR (the 1Ds mk III, 21mp at $7800 without lens), the two cameras have essentially identical imaging quality. Nikon’s equivalent is the D3X, which provides 24.5mp at a somewhat heart-stopping price of $8100 without a lens.
Summary
Right now, the best buys available are probably the Canon T1i, while it lasts, the Nikon D3000, and, depending on requirements, potentially the Sony NEX-3. Sorted by list price, here are the cameras mentioned above:
CHDK-compatible PowerShot cameras – $120 and less
Nikon D3000 – $500
Canon XS – $580
Olympus PEN E-PL1 – $600
Nikon D3100 – $700
Sony NEX-3 – $700
Panasonic DMC-G10 – $700
Canon T1i – $800
Olympus E-P1 – $900
Canon T2i – $1000
Canon 5D mk II – $3600
Nikon D3X – $8100+
Footnote:
Notably absent from the list is the Canon PowerShot G10 whose praises I’ve been singing in the past. It’s an excellent camera, but something of an anachronism. It’s a power user’s compact camera, which is a market that’s less relevant than it used to be. While it was competitive at the time I bought it, its successors and its competitors are no longer of much interest for document scanning when compared to the options that are available now.
» Filed Under Cameras, Digitization, Scanner building | 1 Comment
Scanning rare, fragile militia handbooks
Posted on August 16, 2010
As an archivist, the most important thing for me when scanning materials is ensuring the safety of the original record. I never digitize anything if I’m in doubt it can be scanned without damaging it.
That put me in a bit of a tough spot recently when the library was loaned a set of rare, valuable militia handbooks by Richard Shaver from his personal collection. The handbooks are very valuable information sources that aren’t, to my knowledge, available in digital form so I was eager to add them to the library’s Digital Collections. Unfortunately, the tiny handbooks which measure about four inches tall each are very tightly bound and the spines are fragile at this point. I knew that there was no way the books could be safely used on the usual book scanner I put together since the stress would permanently damage the spines.
Luckily, however, I was able to get a backup plan together. While the spine mobility is limited, the books can be safely opened to 90 degrees and that’s enough to get a clear image of the page with a different camera orientation. Mark Monson recently donated a book scanner to the library based on a design he posted at the DIY Book Scanner forums – familiar to anyone who saw my TAATU presentation in June. His scanner mounts the camera overhead above the book, and uses a pneumatic foot pedal to drive the camera, leaving the operator free to keep both hands to secure the book. I tweaked the design slightly to adjust for book position relative to the camera and to provide a more contrasty background, and the results have been excellent. The scans are very clear and extremely sharp. They’re also over-the-top in resolution thanks to the books’ small size – I think 950DPI sets a new record in my work at the library.
I’m very happy that I was able to scan these books. I always err on the side of caution – but that doesn’t mean that I don’t prefer to come up with new ways to digitize materials when I hit a roadblock. The reward of being able to make these handbooks available online was well worth the work.
For those interested, I’ve taken a short video of the scanner in motion to give an idea of the scanning process. I’ve also included a sample page from the first book scanned at the end of the post. The books are not currently publicly available on the Digital Collections site, but will be when the Burford and Oakland Township collections launch later this year.
» Filed Under Digitization, Scanner building | 5 Comments
What’s coming up, and what would you like to see?
Posted on August 15, 2010
First off, a big welcome to new readers following here from Digitization 101 – and a big thank you to Jill Hurst-Wahl for the link.
I’m currently working on a new post about something new I’ve been working on recently, which I’m hoping to have ready tomorrow or in the next few days. Meanwhile, however, I’d like to ask if there’s anything that readers would like to see me covering.
» Filed Under Uncategorized | Leave a Comment
Piracy for preservation
Posted on July 29, 2010
Recent talk on Arcan-L about the value of photographs divorced from their original collections reminded me of this article by Frank Cifaldi about preservation of video games – and why he believes that piracy is the best way to ensure that unreleased games can be preserved for the future. He regularly tracks down prototypes of games which were never released, migrates the data into modern formats, and releases them to the Internet for free download.
Cifaldi is coming out of the wild west of the games industry, of course. The industry is young, very young. Many of the companies who produced the games he releases, which are primarily Nintendo Entertainment System games from the 1980s, no longer exist; even those that do rarely have archives going back this far. Given the time-bomb nature of the 80s and 90s vintage media these games were stored on, he certainly has a legitimate argument that some form of active intervention is necessary.
Given the Arcan-L discussions on the value of archival items such as individual photographs outside their original context, I thought this issue was an interesting counterpart even if games have more built-in context and content than a single photograph.
» Filed Under digital preservation, Digitization | Leave a Comment
Searchable, web-downloadable PDF books from a DIY scanner
Posted on July 29, 2010
I recently put together a script to assemble web-size PDFs from some of the books that I scanned using my DIY scanner. We had had many requests from users for downloadable Adobe PDF editions of our books, but filesize proved to be a significant problem when assembling them together with Adobe’s Acrobat software. My script uses a set of open-source utilities to create multi-layered PDFs – a low-resolution, web-appropriate colour illustration layer above a high-resolution clear text layer.
Here are a couple of examples from the collection:
http://images.ourontario.ca/brant/75215/data – Herons and Cobblestones, by the Grand River Heritage Mines Society
http://images.ourontario.ca/brant/75674/data – Oakland Township: Two Hundred Years, by Stuart Rammage
I’ve received permission from my employer to release the script as GPL, and I’ll be doing that sometime in the next while. Before I release it, I hope to enhance it with a few additions, potentially including open-source OCR integration. (The current version delivers a non-OCRed PDF which must be processed using Acrobat.)
» Filed Under Digitization, Open-source, Software | Leave a Comment


