Genealogist Lisa Alzo uses a website called Transkribus for recognizing text within images. It's a process that's been around for decades: Optical Character Recognition or OCR. I looked into Transkribus, but it isn’t free. So I searched for free OCR options we can all use.
It turns out a tool you may already be using has this capability. It’s OneNote!
I can think of 2 key reasons to use OCR in genealogy research:
- To pull text from images so you don't have to re-type it.
- To translate a large amount of text from another language.
Last June I wrote about a book that tells the history of one of my ancestral hometowns. (See "How to Use a Foreign-Language Book for Family Tree Research.") A distant cousin sent me the Italian-language book years ago. I began using Google translate and saving the results in a Word document. It’s tedious work, though. I have to type the Italian into Google Translate so it can generate the English translation.
You're probably already using 2 free tools that can do more for your family tree than you know. They can extract text from a genealogy document image. |
Extract Text from a Photo and Translate
Using OneNote, you can:
- Photograph (or scan) the pages of the book.
- Drop the images into a OneNote file.
- Extract the text by right-clicking an image and choosing Copy Text from Picture. This puts the text in memory.
- Paste what's in memory either below the image or in a new section.
- Translate that text by choosing Translate > Translate Page.
The translated text appears in a new section of your OneNote document. It's ready for you to format and look over for any errors. It’s hard to find OCR software that will format your text nicely, so there's always a little work to do. OneNote keeps the line breaks from the original, so you have to do some editing to make it more readable.
The translation uses British English even though U.S. English is set as my preferred language. I'll have to change words like favour, colour, and analysed for myself. And I have to look out for footnote numbers. You know how books use a small, raised number to point you to a footnote? They don't get extracted as a superscript number, so they tend to blend into the text.
I can imagine spending a day putting that book on my scanner, and capturing two pages at a time in an image file. Then I can drop a bunch of images into OneNote, extract and translate.
Turn Handwriting into Text
I did three tests with handwritten Italian documents. OneNote failed to extract the text from them. One of my tests was a 1942 death record with a fill-in-the-blanks format. OneNote extracted the typewritten parts of the form, and skipped over the handwriting!
Then I wrote a simple note in the nicest print I can manage. OneNote couldn't extract any text. If it could, that would be handy for capturing what's written on the back of a family photo.
Then I learned that Google Docs can extract text for you, too. The steps are as follows:
- Log into your free Google Drive account using a web browser or the app.
- Upload an image of the text you want to capture.
- Right-click that image and choose to "Open with" > "Google Docs."
The Doc file will contain the image and its extracted text.
This is an easy way to turn handwriting into text. I tested it on the note I printed, and it worked perfectly. I tested it on an old Italian death record and it didn't recognize anything. But it should be great for the backs of photos or old letters written by your ancestors.
I encourage you to give them both a try.
And speaking of cool genealogy tools: