Tuesday, June 11, 2013

Intelligent Document Capture: The first step towards managing Big Data (Part 2)

Part Two 
In the first post, we attempted to contextualize the problem of Big Data, and in this second part, we’ll take on the job of explaining why document capture is the first step toward a solution. 

Go to part one


The first step: Make the information immersed in our digital content accessible
Computers are machines. They can’t “understand” or give context to the content of our digital assets. If the content of a digital asset or the data which explain the content and the nature of said asset (metadata) aren’t added to something like a index table in our applications, it’s impossible for the machines to find relationships between data and put things, like what we humans would like to understand out of our digital content, in context. Nowadays, we debate whether these index tables can be carried out in the immediate future by relational databases, when we haven’t even taken the first step to index the existing content.


So, we’ve taken on the work of digitalizing everything which we don’t have in digital format, and we’ve come up against the same problem. If we’re not able to let the machine access the real information that our digital assets have, we’re just throwing them into a bottomless pit. We’re forgetting that a scanned document is nothing more than an image, which our human brains can read -- but the same isn’t true of the processors of our machines.

We can use OCR software to take care of part of this problem. We add the content of our documents and our digital assets to the index table of our applications, but if there’s specific data which need to be shared or recovered to be used in specific applications, we can only get OCR to carry out a huge amount of work looking for data in that vast amount of content of the digital assets. Why not make these data easier to access, working with them as if they were metadata? For example, if our accounting software needs to know the number of each invoice, why put the software to work looking for this number inside every invoice, every time it’s required? Wouldn’t it be easier to find it the first time, storing it as the digital asset metadata?

Right, so we’ve fixed the problem of fast access to specific data contained within our digital assets, but there’s still one more problem to solve. With the quantity of documents which we receive every day, is it viable to spend time search through each of these documents for data? No; if we’re talking about Big Data, it certainly isn’t viable. But if we’ve already managed to get the machine to read content inside our scanned documents, why not also get it to extract the information to extract


The most accessible data that we need?
That, folks, is the key: to get the machine to work for us, and that’s the first place where we should invest our money. When we know how to get the most out of the data in our digital content, we’ll discover that the problem of Big Data becomes more of a dilemma of hardware because we can provide the software with the entry point for information that it needs. Document Capture Software is just the beginning, but without it there's no place where to go trying to deal with Big Data.




DOWNLOADSWe explain how Athento helped Crisa manage technical documents.



Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn
Enhanced by Zemanta
Share

Monday, June 10, 2013

Intelligent Document Capture: The first step towards managing Big Data

Part One
In this first post, we’ll go over the challenge that Big Data represents for businesses, in a conceptual manner; in the next post, we’ll explain how document capture could be the first step towards resolving this problem. 


What is Big Data?
Up to now, we haven’t been clear about when we started to talk about Big Data. Some speak of petabytes, others of exabytes; still others don’t need to get to capacities that are that big. The limit between what is (or isn’t) a Big Data problem can be determined by the capacity of their information systems to manage said information.


How can we understand the problem of Big Data?
Gartner lays out the Big Data problem as a three-dimensional solution:

  • Enormous volumes of information: With the rise of new technologies, we’re not just creating more digital content; we’re also creating digital content that takes up more volume. According to the “Digital Universe” report published by IDC, by 2015 we will reach the quantity of 7,910 exabytes of information in the whole, entire world.  
  • Digital content is becoming increasingly varied: These days, we use diverse gadgets and formats to create and store information: voice messages, text messages, videos, e-mails, social networks, bank transactions, etc. All of these different pieces of data can’t be treated in the same way. 
  • These volumes, and the variety of information, need increasingly faster means of processing and recuperation. Many businesses have had to confront system collapses or, simply, users who balk at using information applications because they’re just too slow. Speed, however, has to be seen from another point of view, and that view is how quickly we are creating and storing information now.

The biggest challenge in the era of Big Data
It’s not just so we can get our applications can survive the size of the data; it’s also so that our information systems don’t become black, bottomless pits where we toss our digital content every day. We need to be able to generate real information or knowledge, starting from our digital content.

Don’t miss the second part of our post.


Discover how Athento's intelligent document capture technologies workdownload it



Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn Share

Wednesday, June 5, 2013

Improvements with Digitized Document Images

As you know, we here at Athento have been dedicating ourselves tirelessly to investigating document capture for some time. Our objective is to get many manual tasks, such as data extraction or document classification, to be done in a completely automatic way, with the highest precision possible.


In order that these tasks can be automated, especially the extraction of data, the images have to meet certain minimum quality criteria. Anyone who’s ever had to scan a document knows that once the thing’s been scanned, the document can end up with defects like blurring, black (or white) edges, being off-center, etc.


When data has been extracted from a document, one of the base technologies applied to it is OCR (Optical Character Recognition). Current OCR motors have problems reading the content of the document when the document ends up with quality defects like noise. “Salt and pepper” noise, which isn’t anything but a bunch of grainy spots spread throughout the image, negatively affects the performance of OCR.


Below, you can see a digitalized image which is grainy and contains a fair bit of noise:



In order for data extraction to be the most precise possible, noise has to be eliminated from the image. Francisco González, one of our engineers (affectionately known as “Kurro”), has made it possible for Athento to significantly “clean the noise” from digitalized images.
Here’s the same image, but after being improved and cleaned up by Athento:




Congratulations, Kurro: impressive work!



DOWNLOADSWe explain how Athento helped Crisa manage technical documents.


Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn
Enhanced by Zemanta
Share

Tuesday, June 4, 2013

Infographic: Searches on Document Management Systems and ECM Platforms

A promise made is a debt unpaid. As we have promised,  let's have a quick look at our awesome infographic.


How do users look for documents on DMS and ECM platforms?
Without a doubt, the most commonly-used method to find documents is by using search forms. Even then, a significant number of users still find documents by browsing within folders and files.




What should you know about searches?
The first thing you should know about searches is that they cost you money. Every minute you spend looking for a document costs at least $0.09. The slower the system is, the more expensive it becomes. Another interesting fact about searches is that more than half the respondents of our survey said that they usually search  for documents by words within its content. The bad news for those users is that not all ECM platforms and DMS allow content full-text indexing.




If you would like to check out the whole infographic, you can download it for free from our website.



DOWNLOADSDownload this infographic and learn more about searches on DMS and ECM systems.






Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn
Share

Thursday, May 30, 2013

Graphic Work Flows with Athento

Eng. Manuel Rueda
Having complex work flows – not those which are included as defaults in any DMS or ECM platformis a question of coding. That means that these phases and tasks need to be coded into a work flow so that the system begins to work with them. That’s the reason why there are tools which allow for the modeling of work flows in a more visual manner and which leave the boring work of coding behind; the application performs that work itself while users drag and drop the tasks, events, activities, etc., within their business processes.

The issue is that, after designing a work flow, it’s necessary to put it to work in a document management system or an ECM.

This week, our engineer Manuel Rueda, presented us with a new functionality for Athento which allows for the design of work flows in the JBoss open source tool, Drools. Manuel told us how it’s possible to graphically design a work flow in Guvnor, so that it can be used in Athento.

Guvnor is a Business Rules Manager, which means that it’s the part of Drools which can help us with our BPM (Business Process Management) flows. Below, you can see a screen grab of a Guvnor screen. 



What you see on this screen is the visual editor of Guvnor flows and an example of a work flow modeled with this tool. This work flow is, in reality, an XML file, as you can see in the next screenshot:



Athento has the ability to interpret this XML file (with its start events, end events, gateways, activities, adHoc subprocess, tasks, service tasks, connecting objects, sequence flows) in execution timewithout needing to re-boot the system so that it takes in the new work flow – and gets it to work. This saves a lot of time for both developers and business analysts.

Once the file with the work flow is uploaded, system users can start to use it. In the following screen grab, you can see a request within a process for requests for grants. The flow that the document is going through has been created by uploading an XML file which was generated in Drools.




Discover how Athento's intelligent document capture technologies workdownload it




Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn
Share

Wednesday, May 29, 2013

Version Control in Document Management: Why is it important?

The version of a document or digital content can be defined as the variation within a digital asset or its metadata. In other words, it means updating, editing or change, with respect to a previous version and its metadata.

Editing, updating or changes are common in our daily work, just like having to undo said changes; or, simply, choosing a version among multiple documents which have the same purpose (and which came from one unique version). When there is no version control, those changes are permanent. Editing a document means overwriting content in a file. Once that content is overwritten, there certainly is no way to go back to recover the version created before the changes were made.


There is a second option to help us avoid this problem: saving a new file for each variation we make that’s based on the original. The problem with this strategy is that we have serious problems identifying the differences between the previous versions. We would have to keep looking at the date and time of the last update of the file to know which one was the last. That kind of work is considerably slow.


Version control means always being able to access the latest version, while still having access to previous versions. We can (almost immediately) recuperate a version of the document which is not the current version. The version history will also give us clear clues about what’s been done to a document and will be our road map to show us how to work with the distinct version. On the other hand, without version control, work flows become an ineffectual tool and work in teams on documents becomes more confusing and difficult to coordinate. In the case of work flows, you would have to eliminate every chance for human error (which would be highly improbable to do), given that the flow could only be carried out one way, without any way of going back when there are problems in any one of the stages. With respect to working in teams, we would always be obligated to explain to our co-workers which version is the latest, and on which version they should work.

To sum up, a functionality that is as easy as versioning could mean the difference between working and working twice as much to get the same results.


DOWNLOADSWe explain how Athento helped Crisa manage technical documents.




Popular posts:
Comparing Document Capture Solutions (Athento, Kofax, Ephesoft etc.)
Document Management Success Case at BBVA, managing 7 million records.

LikeUs Yerbabuena Software on LinkedIn
Share