WO1997015026A1

WO1997015026A1 - Processor based method for extracting tables from printed documents

Info

Publication number: WO1997015026A1
Application number: PCT/US1996/016800
Authority: WO
Inventors: Hassan Alam
Original assignee: Bcl Computers, Inc.
Priority date: 1995-10-20
Filing date: 1996-10-21
Publication date: 1997-04-24
Also published as: US5737442A; US5956422A

Abstract

A processor based method for recognizing, capturing and storing tabular data receives digital-computer data representing a document either as a pixel-format document-image (24), or as formatted text file (22). Within the digital computer, either form of the digital-computer data is processed to locate and organize tabular data present therein. After a table (102) has been located, tabular data is extracted from cells represented either as pixel-format document-image (24), or as formatted text file (22). The extracted tabular data is stored into a database (46) present on the digital computer.

Description

PROCESSOR BASED METHOD FOR EXTRACTING TABLES FROM PRINTED DOCUMENTS

Technical Field The present invention relates generally to document image processing using digital computers and, more particularly, to optical character recognition which also recognizes, captures and stores tabular data.

Background Art

Automatic processing of digital data representing an image of a printed document using a digital computer to recognize, capture and/or store information has, for many years, been a subject of active research and commercial products. Thus far, however, such image processing has focused on recognizing, capturing and/or storing texts and even formats present in printed documents. However, in addition to text, many printed dpcuments, particularly financial, scientific and technical documents, contain tabular data. Truly recognizing, capturing and/or storing the entire informational content of such documents necessarily requires capturing more than the format of such tabular data. Rather, truly recognizing, capturing and/or storing such a document's entire informational content requires automatically capturing tabular data in a format suitable for easy computer-based analysis. At present, fully reconstructing the informational content of tables from printed documents in a format suitable for computer-based analysis requires manually re- entering the data from a printed table in a format suitable as input to a database or spreadsheet computer program.

Disclosure of Invention

An object of the present invention is to provide a proces¬ sor-based method for recognizing, capturing and storing tabular data of a printed document. Another object of the present invention is to provide a processor based method for recognizing, capturing and storing tabular data presented to a digital computer as a pixel-format document-image. Another object of the present invention is to provide a processor based method for recognizing, capturing and storing tabular data presented to a digital computer as formatted text. Another object of the present invention is to provide a processor based method for recognizing, capturing and storing tabular data presented to a digital computer either as a pixel- format document-image or as formatted text.

Briefly, a processor based method for recognizing, capturing and storing tabular data in accordance with the present invention receives digital-computer data representing a printed document either as a pixel-format document-image, or as formatted text. The pixel-format document-image may then either be directly processed to locate tabular data present therein, or the pixel-format document-image may be processed by optical character recognition ("OCR") to obtain formatted text. If the pixel-format document-image is immediately processed after receipt to obtain formatted text prior to locating tabular data, or if formatted text was initially received, such formatted text is processed to locate tabular data. After locating tabular data either in a received pixel-format document-image or in the formatted text, tabular data is extracted directly from cells present in either form of digital-computer data, or the tabular data located in the pixel-format document-image may first be processed by OCR to obtain formatted text before extracting tabular data. If the tabular data is extracted directly from a pixel-format document- image, then the extracted pixel-format tabular data must be processed by OCR prior to storage into a database. Alternative¬ ly, tabular data extracted from formatted text may be stored directly in a database.

These and other features, objects and advantages will be understood or apparent to those of ordinary skill in the art from the following detailed description of the preferred embodiment as illustrated in the various drawing figures. Brief Description of Drawings

FIG. 1 is a flow diagram depicting the overall method for extracting tabular data from either pixel-format document-images, or from formatted text; FIGs. 2a - 2d are plan views depicting document pages that illustrate various location in which a table may occur within a page; and

FIG. 3 is a flow diagram depicting processing steps for locating tabular data in pixel-format document-images; FIG. 4 depicts a horizontal projection profile evaluated for a pixel-format document-image;

FIG. 5 is a flow diagram depicting processing steps for locating tabular data in formatted text;

FIG. 6 depicts a table having lines of text to which tokens have been assigned;

FIG. 7 is a flow diagram depicting processing steps for extracting tabular data from cells in formatted text;

FIG. 8 depicts plumb lines used in establishing cells in a table which permit in extracting tabular data; FIG. 9 depicts rectangular regions established in a pixel- format document-image using connected component analysis;

FIG. 10 depicts an overall flow diagram for a computer program implementing the method for extracting tabular data illustrated in FIG. 1; FIG. 11 is a flow diagram depicting in more detail a step in the flow diagram of FIG. 10 for recognizing words in a document with OCR and assigning coordinates to each word;

FIG. 12 depicts the relationship between FIGs. 12a and 12b, the combined FIGs. 12a and 12b forming a decisional flow diagram that depicts in more detail a step in the flow diagram of FIG. 10 for assembling a document into a page image;

FIG. 13 is a flow diagram depicting in more detail a step in the flow diagram of FIG. 10 that finds the number of columns across a page; FIG. 14 depicts the relationship between FIGs. 14a through 14c, the combined FIGs. 14a through 14c forming a decisional flow diagram that depicts in more detail a step in the flow diagram of FIG. 10 that finds tables in the columns; FIG. 15 depicts the relationship between FIGs. 15a and 15b, the combined FIGs. 15a and 15b forming a decisional flow diagram that depicts in more detail a step in the flow diagram of FIG.

10 for joining tables that are identified in a multi-column page image; and

FIG. 16 depicts the relationship between FIGs. 16a, 16b and 15c, the combined FIGs. 16a, 16b and 16c forming a flow diagram depicting in more detail a step in the flow diagram of FIG. io that splits columns and outputs tabular data.

Best Mode for Carrying Out the Invention

Referring now to FIG. 1, in accordance with the present invention tabular data may be extracted either from a received formatted text 22 or from a received pixel-format document-image 24. The formatted text 22 may, in principle, be in any format from which a printed document may be produced using digital computer technology. However, as explained in greater detail below, it may be necessary to process formatted text 22 with a text-format translation-program 26 to obtain a text file 28 having a standardized format before commencing table extraction.

Similarly, the pixel-format document-image 24 may, in principle, be any bit-mapped representation of a printed document obtained initially through sufficiently high-resolution scanning or facsimile of a printed document. If the printed document is received as the pixel-format document-image 24, it may be immediately processed through OCR 32a to obtain the text file 28. Alternatively, the pixel-format document-image 24 may be processed directly through image-based table-location 34, described in greater detail below, to obtain a document-table image 36 which includes only those portions of pixel-format document-image 24 that appear to contain tabular data. After image-based table-location 34, the document-table image 36 may be processed through OCR 32a to obtain the text file 28 for just the tabular portions of pixel-format document-image 24. Processing the pixel-format document-image 24 first through image-based table-location 34 and then through OCR 32a reduces a possibility that a table dispersing event in OCR 32a may vitiate location of tabular data. If printed document data is received as the formatted text 22, or if it is converted from a received pixel-format document-image 24 by OCR 32a into text file 28, then the text file 28 is processed through character-based table-location 38, also described in greater detail below, to obtain a document-table text-file 42. The document-table text-file 42 is processed directly through character based cell extraction 48, described in greater detail below to extract tabular data from cells inherently present in the document-table text-file 42. The tabular data extracted from the document-table text-file 42 is then stored into a database 46.

After image-based table-location 34, instead of processing the document-table image 36 through the OCR 32a, the document-table image 36 may be processed directly through image-based cell-extraction 44, described in greater detail below, to extract tabular data cells inherently present in the pixel-format document-image 24. After processing the document-table image 36 through the image-based cell-extraction 44, only those portions of the pixel-format document-image 24 constituting tabular data cells are processed through OCR 32b before the extracted tabular data is stored into the database 46. As used herein, the term "database" includes also spread¬ sheets. As is well recognized by those skilled in the art, digital computer data formatted for processing by a database program can be either directly accepted by, or readily re¬ formatted for processing by a spreadsheet computer program.

Translation of Formatted Text

The formatted text 22 may be received in various different formats, e.g. any one of various different word-processing- program file formats. Prior to performing table extraction on such digital-computer data representing a printed document, the received word-processing-program file is processed through the text-format translation-program 26 to translate the received text into a standard format, preferably ASCII. Presently, text- format-translation programs are commercially available for translating text formatted in virtually any format, e.g. a word- processing-program format, into ASCII. Accordingly, to simplify table extraction, in general, it is advisable to process received formatted text with a the text-format translation-program 26 prior to commencing extraction of tabular data.

OCR Processing of Pixel-Format noeim-ant-lmages

Table geometry provides significant semantic demarcation for tabular data. Available OCR computer programs attempt to preserve table geometry present in pixel-format document-images by inserting blank lines and space characters. Thus, OCR processing captures a pixel-format document-image's format, as well as its text. Although the OCR process may not be error free, formatted text produced by OCR provides a sound basis for preliminary analysis of tabular data. Accordingly, OCR may be employed for converting a pixel-format document-image into formatted text prior to the image-based table-location 34 or prior to image-based cell-extraction 44. Moreover, OCR 32b must always be performed on cells containing tabular data extracted by the image-based cell-extraction 44 before the extracted tabular data may be stored into the database 46. Locating Tabular Data in

Pixel-Format Document-images

In the pixel-format document-image 24 there can exist only three types of table formats:

1. bounded tables in which all table elements are com- pletely enclosed within lines;

2. partially bounded tables in which one or more elements are not completely enclosed within lines; and

3. unbounded tables which contain no lines.

Locating a table in the pixel-format document-image 24 involves differentiating any tables from other elements such as body text, headings, titles, bibliographies, lists, author listings, ab¬ stracts, line drawings, and bitmap graphics. FIGs. 2a through 2d depict possible locations for tables 102 among text columns 104 on a printed page 106. The text columns 104 may be embedded within a text column 104 on a multi-column page 106 as illustrat¬ ed in FIGs. 2a, 2b and 2d. A table 102 may span more than one text column 104 on a multi-column page 106 as illustrated in FIGs. 2c and 2d. And a table 102 may even be embedded in a text column 104 on a page 106 having a table 102 that spans two or more text columns 104 as illustrated in FIG. 2d. Furthermore, a page may contain only tables.

Referring now to the flow chart of FIG. 3, the first step in locating tabular data in a page 106 of the pixel-format document-image 24 is to determine a skew angle as illustrated in processing block 112. Skewed images present a problem for document analysis due to an assumption that text flows acrosε a page parallel to a horizontal axis. The method described in "Analysis of Textual Images Using the Hough Transform," S. Srihari and V. Govindaraju, Machine Vision and Applications, vol.

2, pp. 141-153, (1989), incorporated herein by reference, permits determining a skew angle for the pixel-format document-image 24.

The next step is determining an approximate upper and lower boundary for any table 102 which extends across the page 106 as illustrated in FIGs. 2c and 2d. Identifying the upper and lower boundaries of a table 102 present in the page 106 includes evaluating a horizontal projection profile for white-space on the page 106 as illustrated in processing block 114. As depicted in FIG. 4, a horizontal projection profile 115 transforms two-dimen- sional data into one-dimensional data. Moreover, using the previously determined skew angle, a horizontal projection profile 115 may be taken of the page 106 at any angle thereby avoiding the need to de-skew the entire page 106 before processing. In evaluating the projection profile in processing block 114 of FIG. 3, it is preferred to use a modification of a recursive X-Y Cut method described in "Hierarchical Representation of Optically Scanned Documents," G. Nagy and S. Seth, Proceedings of 7th International Conference on Pattern Recognition, pp. 347-349 (1984), that is incorporated herein by reference. The horizontal projection profile identifies significant vertically distributed white-space gaps. Vertically spaced white-space gaps appear as a series of zero, or near zero, values in a projection profile vector "P" that is evaluated along each horizontal scan line across the page 106. Since noise and horizontal lines may mask a series of zero values in the projection profile, a projection profile value, p., for each scan line is evaluated using: 1. a pixel count (n-) that equals the number of black pixels along the scan line;

2. a cross count c^ ) that equals the number of times the pixel value changes from black to white or from white to black along the scan line; and

3. an extent (β_j) that equals the number of pixels between the first and last black pixel along the scan line.

Using the pixel count, the cross count, and the extent, the projection profile value, p., is determined for each scan line as set forth below.

If a scan line contains pixels from a horizontal line, or a bar of noise along the left or right margin of the pixel-format document-image 24, then n₃ > c. and, therefore, c-,²/n_: will be smaller than n.. If a scan line contains "speckled noise", ascenders, or descenders, then generally e, > n_D, and, therefore, n-³/^e-_j will be smaller than n.. Furthermore, if p_: as evaluated above is below a threshold (t) , preferably 5, p- is set equal to zero.

To find a possible table 102, it is then necessary to identify upper and lower boundary white-space gaps, illustrated in processing block 116, in the horizontal projection profile data using statistical methods. In particular, significant white-space gaps in the vector of projection profile values, i.e.

are intervals:

1. throughout which the projection profile values, p. , are zero; and

2. which have a width that is k standard deviations above the mean width for all intervals having zero projec- tion profile values, where preferably k = 1.5.

Horizontal white-space gaps identified statistically in this way are then analyzed as possible upper or lower table boundaries by determining: 1. if a particular white-space gap belongs to the interi¬ or of a sequence of equally spaced white-space gaps; or

2. if a vertical projection profile of the apparent table 102 reveals consecutive vertical regions across the page 106 containing significant white-space gaps, and those vertical white-space gaps cross the particular horizontal white-space gap.

If either of these conditions exist for a particular horizontal white-space gap, then the horizontal white-space gap extends across the interior of a possible table 102, and cannot be part of an upper or lower boundary of a possible table 102. By eliminating those horizontal white-space gaps that satisfy the preceding criteria, the remaining horizontal white-space gaps constitute approximate upper and lower boundaries that may enclose a table 102 present on the page 106.

Regardless of whether the preceding process identifies a table 102, if the page 106 has not been processed for multiple columns as determined in decision block 118, it is necessary to identify upper and lower boundary white-space gaps along vertical scan lines of the page 106 as illustrated in processing block 122. This vertical projection profile merely counts the number of black pixels along each vertical scan line. A vector containing all the vertical projection profile values is then analyzed to determine whether the page contains more than one column.

Assuming that no page 106 of the pixel-format document-image

24 has more than 2 columns and also assuming that white-space between entries in the table 102 is significantly wider than white-space separating text columns 104, the presence of multiple text columns 104 is determined by analyzing a vector "Z" of vertical projection profile values that is evaluated as follows.

1. Determine a width, z_n, for each interval of vertical projection profile values that are less than 75% of the maximum vertical projection profile value through¬ out an interval of at least 5 consecutive vertical projection profile values; and 2. determine a width, z_n, for each interval of vertical projection profile values in which there does not exist 5 consecutive vertical projection profile values that are less than 75% of the maximum vertical projec- tion profile value.

The vector Z then equals

. ^{Z =}. ( ^zl > . < ^Z2 > *2» * * •' ^zn > %» ^zn* ) ^■

A column exists if

Z_j < 30, for 2 ≤ j ≤ n, and min (IT,) =^- > 0.9, for 1 < < n . ma fZ )

If multiple text columns 104 exist, decision block 124 causes processing of each such text column 104 using horizontal projec¬ tion profiles in an attempt to identify upper and lower bound¬ aries for additional possible tables 102. In attempting to identify approximate upper and lower boundaries of additional possible tables 102 in the text columns 104, each text column 104 is successively processed in the same way as described above for processing the entire page 106.

In this way, upper and lower boundaries for possible tables 102 that extend across the page 106 are first determined, and then upper and lower boundaries are determined for any possible tables 102 that extend across only an individual text column 104. A horizontal location on the page 106 is also established for each of the possible tables 102 identified in this way based upon whether the table 102 was identified in a horizontal projection profile for the entire page 106, in which case the table 102 is centered on the page 106, or was identified in a horizontal projection profile for one of the text columns 104, in which case the table 102 is either on the left-hand or right-hand side of the page 106. Only those portions of the pixel-format document-image 24 which appear to encompass a table 102 are stored into the document-table image 36 for further processing.

Locating Tabular Data in Formatted Text Two independent strategies are employed for locating tables

102 within a page 106 represented by data present in the text file 28. The first strategy for locating a table 102 is to scan the text of the page 106 looking for certain keywords that frequently appear in the headers of tables 102. A second strategy for locating a table 102 is based upon identifying an arrangement of text that is characteristic of components of a table 102, i.e. a grouping of blocks of text and white-space on a page 106 that occurs in tables 102. However, before attempting to locate a table 102 on the page 106, as illustrated in the processing block 132 in FIG. 5 that portion of data in the text file 28 representing the page 106 is scanned to determine:

1. coordinates for the leftmost and rightmost characters on the page 106;

2. coordinates for each character on the page 106;

3. coordinates for the beginning and end of each group of non-blank characters in each text line of the page 106; 4. coordinates for the beginning and end of each group of blank characters in each text line of the page 106; and

5. if the page 106 has one or multiple text columns 104.

The presence of 2 text columns on a page 106 represented by data of the text file 28 is established if 3 blank characters occur centered horizontally at the middle of the page 106 in 15 consecutive text lines anywhere on the page 106. If such a white-space area does not occur anywhere along the center line of the page 106, then the page 106 has only a single text column 104.

Next, as depicted in processing block 134 of FIG. 5, the text in the text file 28 representing the page 106 is searched for a text line that contains a keyword, and that is an upper boundary for a possible table 102. In locating a possible table 102 by keyword searching, if either the word "table" or the word "figure" is located above a possible table 102, the keyword must be preceded by at least 5 blank characters in the same line as the keyword. Moreover, the text line containing the keyword must be immediately preceded by a text line containing only blank characters.

Having thus identified a beginning for a table, as illus¬ trated in processing block 136 of FIG. 5, the table is then characterized as being located either on the left side of the page 106, on the right side of the page 106, or centered on the page 106. If the coordinate of the first character in the keyword is larger than one-half the width of the text on the page 106, then the possible table 102 is located on the right side of the page 106. If the possible table 102 is not located on the right side of the page 106, and if 3 blank characters occur centered horizontally at the middle of the page 106 in each o_f the 10 text lines immediately below the line containing the keyword, then the possible table 102 is located on the left si_de of the page 106. If the possible table 102 is located neither on the left side nor the right side of the page 106, then the possible table 102 is centered on the page 106.

Characterizing the possible table 102 as being located either on the left side of the page 106, centered on the page 106, or on the left side of the page 106 establishes horizontal boundaries for the possible table 102. If the possible table 102 is on the left side of the page 106, then the left boundary is the leftmost character on the page 106, and the right boundary is in the center of the page 106. If the possible table 102 is centered on the page 106, then the left boundary is the leftmost character on the page 106, and the right boundary is the rightmost character on the page 106. If the possible table 102 is on the right side of the page 106, then the left boundary is in the center of the page 106, and the right boundary is the rightmost character on the page 106.

Finding a lower boundary for a possible table 102 identified by keyword searching, as illustrated by processing block 138 in FIG. 5, requires table-format analysis of text lines immediately below the text line containing the keyword. Table-format analysis of text lines employs the concept of an "entity" which, for the purpose of locating a lower boundary of the possible table 102, is defined to be a set of non-blank characters in a text line that is bounded:

1. on the left and the right either by 2 or more space characters; or

2. the left and/or right table boundary. Employing the preceding definition of an entity, table-format analysis classifies each text line in a possible page 106 into one. of three categories:

B a text line containing only blanks S a text line containing a single entity

C a text line containing multiple entities Each text line below the line containing the keyword is then classified in accordance with the preceding list, and the appropriate token is assigned to each line, i.e. the letter B, S or C, as depicted in FIG. 6. After all these text lines have been classified, a search is then performed to locate a lower boundary for the possible table 102. Starting at the text line immediately beneath the text line containing the keyword and continuing downward line-by-line, skip each line until encounter- ing a blank "B" text line. If any of the token patterns listed below occur in the 4 lines immediately below the blank "B" text line, then resume searching downward for the next blank "B" text line. If none of the token patterns listed below occur in the 4 lines immediately below the blank "B" text line, then the blank "B" line is the lower boundary of the possible table 102. Set forth below are token patterns which must occur in the 4 lines immediately below a blank "B" line if the table 102 continues beneath that blank line.

1 Line Pattern

C a text line containing multiple entities

2 Line Patterns

B-C a text line containing only blanks a text line containing multiple entities

S-C a text line containing a single entity a text line containing multiple entities

3 Line Patterns

B-B-C a text line containing only blanks a text line containing only blanks a text line containing multiple entities S-S-C a text line containing a single entity a text line containing a single entity a text line containing multiple entities 4 Line Patterns B-B-B-C a text line containing only blanks a text line containing only blanks a text line containing only blanks a text line containing multiple entities

S-B-S-C a text line containing a single entity a text line containing only blanks a text line containing a single entity a text line containing multiple entities S-S-S-C a text line containing a single entity a text line containing a single entity a text line containing a single entity a text line containing multiple entities Regardless of whether keyword searching locates a table 102, the text file 28 representing the page 106 is also processed to locate possible tables 102 based upon identifying an arrangement of text that is characteristic of components of a tabular format. In locating a table 102 possibly present within the page 106 represented by text in the text file 28 as illustrated by processing block 144 in FIG. 5, the smallest possible table 102 is to be identified. A first test for identifying the smallest possible table 102 is to find within the text file 28:

1. the same number of groups of non-blank characters in two text lines occurring within a group of 4 consecu¬ tive text lines; and 2. the number of groups of non-blank characters in both text lines exceeds 1. If the preceding conditions occur, then a possible table 102 has been located.

An alternative test for identifying the smallest possible table 102 is to find 2 immediately adjacent lines in a text column in which sufficiently wide white-space columns extend across the 2 immediately adjacent lines of text and between all groups of non-blank characters in both text lines. That is, for all the groups of non-blank characters in two immediately adjacent text lines, the end coordinate for one group of non- blank characters in one text line is subtracted from the beginning coordinate for the immediately successive group of non- blank characters in the immediately adjacent text line. beginning_cooιdinate( i +1 ) _j - ending_cooιdina te(i) ._±1

If every difference thus computed exceeds 2, then a possible table 102 has been located.

Having found a possible table 102, it is now necessary to determine its horizontal position on the page 106 as illustrated by processing block 146 in FIG 5. That is, it is now necessary to determine if the possible table 102 is on the left-hand side of the page 106, the right-hand side of the page 106, or crosses a vertical center-line of the page 106. A horizontal location of a possible table 102 is determined by comparing the starting coordinate for both the first and last group of non-blank charac¬ ters in two immediately adjacent text lines of the possible table 102. If beginning coordinates for both the first and last group of non-blank characters in the 2 immediately adjacent text lines lie either on the left-hand or on the right-hand side of the page 106, then the possible table 102 is located on that side of the page 106. However, if starting coordinates for both the first and last group of non-blank characters in 2 immediately adjacent text lines lie on opposite sides of the page 106, then the possible table 102 is centered across the page 106.

As before, characterizing the possible table 102 as being located either on the left side of the page 106, centered on the page 106, or on the left side of the page 106 establishes horizontal boundaries for the possible table 102. If the possible table 102 is on the left side of the page 106, then the left boundary is the leftmost character on the page 106, and the right boundary is in the center of the page 106. If the possible table 102 is centered on the page 106, then the left boundary is the leftmost character on the page 106, and the right boundary is the rightmost character on the page 106. If the possible table 102 is on the right side of the page 106, then the left boundary is in the center of the page 106, and the right boundary is the rightmost character on the page 106. Having determined a horizontal position for the possible table 102, approximate upper and lower boundaries must then be determined as illustrated by processing block 148 in FIG. 5. Upper and lower boundaries for the possible table 102 are determined using the same method as described previously for keyword searching of assigning a token to each text line both above and below the smallest possible table 102, and then searching first upward and then downward until a table-terminat¬ ing token sequence occurs in each direction. If the same area on the page 106 is identified as a possible table both by keyword searching and by table component analysis, then all redundant instances of the possible table 102 are eliminated. Accordingly, possible tables 102 are identified, and upper and lower, and left and right boundaries are determined for distinct such table 102.

Extracting Tabular Data from Cells Present in Formatted Text

Having obtained an upper and lower boundary and a left and right boundary on the page 106 for a possible table 102, ASCII text initially present in the text file 28 is processed in character based cell extraction 48 illustrated in FIG. 1 to extract tabular data. Table extraction from ASCII text is much easier than from the pixel-format document-image 24 or document-table image 36 for two reasons. First, transforming the pixel-format document-image 24 or document-table image 36 into the text file 28 reduces the coordinate system from pixel coordi¬ nates to text coordinates. Although transforming the pixel-format document-image 24 or the document-table image 36 into the text file 28 is a "lossy" transforation, i.e. font sizes and graphical lines are lost, isolating lines of text and white- space in ASCII text is faster and easier than using image processing techniques such as connected component analysis and/or projection profiles. Second, the words in a possible table 102 can be analyzed for content and semantic meaning, i.e. data types and keywords can be identified. The two preceding advantages compensate to some extent for geometrical errors that occur occasionally in transforming the pixel-format document-image 24 into the text file 28. In extracting the tabular data in character based cell extraction 48, an adaptation of techniques described in "Identi¬ fying and Understanding Tabular Material in Compound Documents," A. Laurentini and P. Viada, Proceedings of the International Conference on Pattern Recognition, pp. 405-409, (1992), that is incorporated herein by reference, is applied within the bounded area of the page 106 to divide the text column 104 into a Cartesian grid of basic cells, and to assign spreadsheet-like coordinates to each cell. Since the lines of ASCII text establish rows in the table, establishing boundaries for the cells requires only finding implied vertical separation lines between cells of the text column 104. Blank space intervals along the lines of ASCII text permit establishing "plumb lines" which extend vertically through the table. To establish plumb lines 154, illustrated in FIG. 8, for each text line in the table a white-space-interval vector

R_j = { (a_1# b ) , (a₂, b₂) , • • •, (a_n, b_π) }

is evaluated in processing block 152 illustrated in FIG. 7. In evaluating each white-space-interval vector R_D, a, is the coordi¬ nate of the left-hand end of a white-space interval, and b. is the coordinate of the right-hand end of a white-space interval.

The vectors R-,_, illustrated in FIG. 8, are then processed to establish white-space columns 156 that extend vertically across the entire table 102. The midpoint of each white-space column 156 is a potential plumb line 154. To locate the plumb lines 154, the intersection of all R- having more than 2 white-space intervals is formed to establish a white-space-vector intersection, "R," as illustrated in processing block 158 of FIG. 7.

R = R_ n R₂ n • • • n R_n The midpoint of each space interval in R represents a potential vertical plumb line 154 in the table 102.

Each R, that contains two white-space intervals can be either flow text, e.g. a header or a footer, or a text line which contains only a single table entry. For each R- having only 2 white-space intervals, the intersection of R. with R is formed to obtain an augmented white-space-vector intersection "R'." R' = R Pl R

If the number of white-space intervals in R¹ is much smaller than the number of white-space intervals in R, then text line i contains flow text and is not to be included in R. However, if the number of intervals in R' is only slightly smaller than, or even the same as, the number of intervals in R, then R is set equal to R', and line "j" is included in the body of the table 102 as illustrated in processing block 162.

The table plumb lines identified in this way together with the horizontal text lines divides the table into a Cartesian gri_d of basic cells to which are assigned spreadsheet like horizontal coordinates 166 and vertical coordinates 168. ASCII strings present in each of these cells, such as those set forth below that have been abstracted from FIG. 8, constitute the tabular data.

Cell[0] <NULL> Cell[0] <NULL> Cell[0] isotropic

Cell[l] <NULL> Cell[l] <NULL> Cell[l] consolidation stress

Cell[2] Void Ratio Cell[2] (e) Cell[2] p'o (kPa)

Cell[3] 0.809 Cell[3] <NULL> Cell[3] 384

Extracting Tabular Data from cells Present in Pixel-Format Document-Images

The basic strategy set forth above for extracting tabular data from cells present in the document-table text-file 42 using plumb lines 154 may also be applied to document-table image 36 to extract tabular data. In extracting tabular data from the document-table image 36, as illustrated in FIG. 9, rectangular regions 172 in the document-table image 36 that contain text are established using connected component analysis. Then vertical projection profiles of the rectangular regions 172 are evaluated to determine vertical plumb lines 154 extending through the white-space between the rectangular regions 172. The rectangular regions 172 and the plumb lines 154 determine cells of the document-table image 36 that contain tabular data. These cells in the document-table image 36 are then processed individually by the OCR 32b before storing the tabular data thus obtained into the database 46.

Storing Tabular Data into a Database

As illustrated above, the result either of the character based cell extraction 48 or of the image-based cell-extraction 44 and OCR 32b is a rectangular array of cells each one of which contains either a unique data value, or contains nothing, i.e. <NULL>. It is readily apparent to those skilled in the art that the data present in the cells illustrated above may be easily stored into a computer file in any one of a number of different formats that can be processed as input data by conventional database or spreadsheet computer programs. For example, the data in the cells may be stored into a file as tab-delimited text.

Computer program Flow Diagrams A computer program executed by a digital computer, that is illustrated by the overall flow diagram of FIG. 10, implements the present embodiment of the invention for extracting tabular data in the manner depicted in FIG. 1. A processing block 202 in FIG. 10, which corresponds to OCR 32a and image-based table-location 34 depicted in FIG. 1, uses a standardized document processing Application Program Interface computer program TextBridge® (a registered trademark of the Xerox Corporation) to perform OCR on a scanned document, assigning coordinates to each word in the document. Word and coordinate data produced by the computer program TextBridge are stored into a file in XDOC format. The computer program TextBridge is available from Xerox Corporation, 9 Centennial Drive, Peabody, Massachusetts 01960. Using the word coordinate data thus extracted from the document, processing block 204 assembles a page image of the document being processed. The page image of the document is then further processed in processing block 206 to determine number of columns across the page. Further processing of the page image in processing block 208 finds the tables among the previously identified columns. The processing performed in processing blocks 204 through 208 correspond to the character-based table-location 38 depicted in FIG. 1. After finding tables among the previously identified columns, if a document has a multi-column page image, pairs of tables are joined together into single larger tables in processing block 212, if joining tables is proper. Finally, in processing block 214 the columns are split and the extracted tabular data output. The processing performed in processing blocks 212 and 2₁₄ correspond to the character based cell extraction 48 depicted in FIG. 1.

The flow diagram depicted in FIG. 11 illustrates in greater detail processing block 202 depicted in the flow diagram of FIG. 10. The process for converting a document using OCR and assigning coordinates to words begins in processing block 222 with an initialization of the TextBridge computer program. After the TextBridge computer program has been initialized, the scanned document image may be cleaned-up in processing block 224 to remove noise, to de-skew the document image, and perform character repair. The document image is then processed by the

TextBridge computer program in processing block 226 to assign coordinates to each word on the page, and processing block 228 saves the results from TextBridge processing into a file in XDOC format. The processing block 204 in FIG. 10, illustrated in greater detail by the combined FIGs. 12a and 12b, depicts processing the XDOC file data to generate a page image by analyzing various commands generated during TextBridge processing that have been stored into the XDOC format file. In processing the XDOC format data, the computer program generates two page images, one image which stores each literal item of alphanumeric data present in the XDOC data, and another image which stores an image made up of codes that specify alphanumeric data characteristics. Assembly of these page images begins in processing block 232 with opening of the file previously saved in processing block 228 of FIG. 11. After the file has been opened, various data values needed for subsequent processing including globals and variables are initialized in processing block 234. A processing block 236 then reads a record from the XDOC output file.

The XDOC file record read in processing block 236 is first examined in decision block 238 to determine if the record specifies a type font. If the record specifies a type font, then processing block 242 adds that type font to a list of fonts, and records if this is the smallest font size encountered thus far in processing the XDOC output. After recording the font information, the computer program in decision block 244 on FIG. 12b determines if all of the XDOC file has been processed. If the end of the XDOC file has not been reached yet, then the computer program returns to processing block 236 to read and process the next record in the XDOC file.

If a record from the XDOC file does not specify a type font, the record is then examined in decision block 252 to determine if the record specifies a new line on the page image. If the record specifies a new line, the computer program in processing block 254 sets-up to process a new line by appropriately initializing line processing variables. After setting-up to process a new line, the computer program proceeds to decision block 244 to again determine if all of the XDOC file has been processed. If the end of the XDOC file has not been reached yet, then the computer program once again returns to processing block 236 to read and process the next record in the XDOC file. If a record from the XDOC file does not specify a font or a new line, the record is then examined in decision block 262 to determine if the record specifies a blank area on the page. If the record specifies a blank area, the computer program in processing block 264 adds the blank area to the page image using the smallest font size, and updates the line length. After recording the information for a blank area, the computer program proceeds to decision block 244 to again determine if all of the XDOC file has been processed. If the end of the XDOC file has not been reached yet, then the computer program once again returns to processing block 236 to read and process the next record in the XDOC file.

If a record from the XDOC file does not specify a font, a new line or a blank area, the record is then examined in decision block 272 to determine if the record specifies alphanumeric data. If the record specifies alphanumeric data, the computer program in processing block 274 adds the alphanumeric data to the page image using the smallest font size, and updates the line length. After adding the alphanumeric data to the page image, the computer program in processing block 276 inserts the character •1^• to the coded page image. After adding •1* to the coded page image, the computer program in decision block 278 determines if the line length in the page images after adding the alphanumeric data and the characters '1' is less than the alphanumeric data's right margin. If the line length in the page images after adding the alphanumeric data and the characters '1' is not less than the alphanumeric data's right margin, then the computer program proceeds to decision block 244 to again determine if all of the XDOC file has been processed. If the line's length is less than the page's right margin, then in processing block 282 the computer program pads the page image's alphanumeric data on the r-ight with blanks, and pads the "1" in the coded page image on the right with the character *2. ' After padding both the alphanumeric data and the coded data page images, the computer program proceeds to decision block 244 to again determine if all of the XDOC file has been processed. If the end of the XDOC file has not been reached yet, then the computer program once again returns to processing block 236 to read and process the next record in the XDOC file.

If the record read from the XDOC file does not specify a font, a new line, a blank, or alphanumeric data, then in processing block 286 data in the XDOC file record is ignored and the computer program proceeds directly to decision block 244 to again determine if all of the XDOC file has been processed. If the end of the XDOC file has not been reached yet, then the computer program once again returns to processing block 236 to read and process the next record in the XDOC file.

After processing the last record in the XDOC file, the computer program proceeds to processing block 292 in which it saves the page image into a file.

The flow diagram depicted in FIG. 13 illustrates in greater detail processing block 206 depicted in the flow diagram of FIG. 10 that finds the number of columns across a page. The first operation in finding the number of columns across a page is to determine, in processing block 302, a vertical projection profile from the page image developed in processing block 204. In processing block 304, the computer program then determines the average height of the vertical projection profile. The vertical projection profile is then analyzed in decision block 306 to determine if there exist any immediately adjacent set of vertical projection profiles that are: 1. wider than a value specified by a "Column Gap" global variable; and

2. have a height less than one-quarter of the average vertical projection profile height determined in processing block 304. The Column Gap global variable preferably has a value equal to two (2) character widths. If decision block 306 identifies any gaps, processing block 308 establishes tentative columns based upon the identified gaps. The tentative columns established in processing block 308 are then compared in decision block 312 with each other to determine if their respective widths are within 90% of each other. If the respective tentative column widths are not within 90% of each other or if no column gaps were identified in decision block 306, then processing block 314 establishes only a single column for the page image. If, however, the tentative column widths are within 90% of each other, then the columns are accepted in processing block 316, and the computer program in processing block 318 saves both the number of columns and upper and lower boundaries for the column.

The processing block 208 in FIG. 10, illustrated in greater detail by the combined FIGs. 14a through 14c, depicts finding tables in the columns. As depicted in FIG. 14a, the computer program in finding tables in the columns first initializes a "Table Row" vector in processing block 322. Having initialized the Table Row vector, the computer program determines in decision block 324 if the end of the page image has been processed.

If the end of the page has not been reached, then the computer program in processing block 326 selects for processing the next line down the page image. Processing the next line down the page begins with a determination in processing block 328 of whether the next line in the page image is a line of text. If the selected line does not contain text, then decision block 332 determines whether the line is centered horizontally on the page image. If the selected line does not contain text, and if the line is not centered horizontally on the page image, then in processing block 334 the current line is ORed bitwise into the Table Row vector. After ORing the current line into the Table Row vector, the computer program searches the Table Row vector in decision block 336 to determine if there are any gaps in the Table Row vector that are narrower than the width specified by the Column Gap global variable. If a gap narrower than the Column Gap does not exist in the Table Row vector then the computer program returns to processing block 326 to process the next line in the page image. If the current line contains text or is centered on the page, or if there exists a gap in the Table Row vector that is narrower than that specified by the Column Gap global variable, then in decision block 342 the computer program determines if more than two lines have been ORed into the Table Row vector. If more than two rows have been ORed into the Table Row vector, then the computer program in processing block 344 cleans up noise in the table. The table identified in the preceding manner from which noise has been removed is then stored into a file in processing block 346. If 2 or fewer lines have been ORed into the Table Row vector, or if a table has been stored, the computer program then determines in decision block 348 if page image processing has reached the end of page image. If the end of the page image has not been reached, then the computer program returns to processing block 322 to again initialize the Table Row vector. If decision block 348 deter¬ mines that processing has reached the end of the page, then the computer program has found all the tables in the page image and exits processing block 208.

FIG. 14b is a flow diagram depicting how a line is tested in processing block 328 to determine if it is a text line.

Before determining to determine if a line is a text line, the computer program in processing block 352 first determines the page width, the line width, the line character count, the maximum line character count, and the number of columns. Then, in decision block 354 the computer program tests the line to determine if the line is empty. If the line is empty, then it is not a text line. If the line is not empty, the computer program in decision block 356 then tests the line to determine if there is exactly 1 character in the line. If there is exactly 1 character in the line, then it is a text line. If there is not exactly l character in the line, then the computer program in decision block 358 determines if the number of columns is greater than the minimum number of columns. If the number of columns in the line is greater than a minimum columns parameter, then the line is not a text line. The minimum columns parameter prefera¬ bly has a value of 2. If the number of columns is not greater than the minimum columns parameter, the computer program in decision block 326 then determines is the width of characters in the line is greater than .6 of the width of characters across the page. If the width of characters across the line is greater than .6 of the width of characters across the page, then the line is a text line. If the width of characters in the line is not greater than .6 of the width of characters across the page, then the computer program in decision block 364 determines:

1. if the line's character count is greater than the maximum line character count; and

2. if the line length is greater than .4 of the page width.

If a line satisfies the two preceding criteria, then it is a text line. If a line does not satisfy the two preceding criteria, it is not a text line.

FIB. 14c is a flow diagram depicting the table noise clean- up performed in processing block 344. Noise in a table is first cleaned-up in processing block 372 by removing all rows from the top of the table that contain no or only 1 character. Table noise is further reduced in processing block 374 by removing all rows from the bottom of the table whose right margin is less than the first column width. This eliminates artifacts introduced by the document scanning process and/or a small number of characters that sometime appear in the left hand column at the bottom of tables. The processing block 212 in FIG. 10, illustrated in greater detail by the combined FIGs. 15a and 15b, joins tables identified in processing block 208 that are part of a multi-column page image. Accordingly, the computer program in decision block 382 first determines if a page image having multiple columns is being processed. If a page image having only a single column is being processed, then the computer program immediately exits processing block 212. If tables are being found in a multi-column page image, then in processing block 384 the computer program begins processing the next column, starting with the first, looking for tables that may extend horizontally across multiple columns. The computer program then determines in decision block 386 if all columns have been processed in the page image. If all columns in the page image have been processed, then the computer program in processing block 388 saves all table coordinates into a file, and exits from processing block 212. If all columns in the page image have not been processed, then in processing block 392 the computer program gets the next table, beginning with the first, of those found in performing processing block 208. After getting the next table, the computer program in decision block 394 determines if all tables found while performing processing block 208 have been processed. If all tables have been processed, then the computer program returns to processing block 384 to get the next column. If all tables found by the computer program while performing processing block 208 have not been processed, then the computer program performs a search to determine if the present table overlaps vertically with any other tables found while performing processing block 208. Finding such an overlap between tables begins in processing block 396 in which the computer program gets the next table. Now having selected a pair of tables to compare for overlap, the computer program in decision block 398 deter¬ mines if all tables found in processing block 208 have, been processed. If all tables have not been processed, then the computer program in decision block 402 determines if any vertical overlap exists between the table selected in processing block 392 and the table selected in processing block 396. If no vertical overlap exists between the two tables, the computer program immediately returns to processing block 396 to get the next table. If after selecting a table in processing block 396 all of the tables have been processed, the computer program then passes through decision block 398 to decision block 404 which determines if any overlap has been found among the tables. If no overlap has been found among the tables, then the computer program returns from decision block 404 to processing block 384 to get the next column.

If in decision block 402 or in decision block 404 an overlap exists between a pair of tables, then the computer program in processing block 406 sets an Overlap Found variable and estab¬ lishes a new table boundary that encompasses both tables. After establishing the new table boundary, the computer program in processing block 412 moves to the immediately adjacent column to determine if the table extends horizontally into that column. Accordingly, in a decision block 414 the computer program determines if all of the columns have been processed. If all of the columns have been processed, then the computer program returns to processing block 384 to get the next column. If all of the columns have not been processed, then in processing block 416 the computer program gets the next table. In decision block 418 the computer program then determines if all of the tables have been processed. If all of the tables have been processed, then the computer program returns to processing block 384 to get the next column. If all of the tables have not been processed, then the computer program in decision block 422 determines if any overlap exists between the tables found by joining two tables and the table selected in processing block 416. If overlap exists between the two tables, then the computer program returns to processing block 406 to further enlarge the table boundary. If there exists no overlap between the two tables, the computer program returns to processing block 416 to process the next table.

The processing block 214 in FIG. 10, illustrated in greater detail by the combined FIGs. 16a through 16c, depicts splitting columns in the tables that have been found, and outputing tabular data. The flow diagram depicted in FIGs. 16a through 16c depicts processing a single table, and is therefore performed iteratively once for each table that has been identified.

Splitting of columns in the tables begins in processing block 432 with the creation of a row vector that is initially full of blanks, and that has a length equal to the character width of the table. All of the rows of the table are then ORed into the vector to establish a vector that indicates the location of plumb lines consisting of whitespace that extend downwar_d through the table. Then in processing block 434 the computer program finds all gaps in the table rows that are greater t_han the minimum column separation. In processing block 436, the computer program gets the next row in the table beginning with the top row. In processing block 438 the computer program gets the next plumb line beginning with the first. In processing block 442 the computer program replaces the space in the table row of the page image that is occupied by the plumb line together with any whitespace on either side of the plumb line with a tab character. Then in decision block 444 the computer program determines if all plumb lines have been processed. If all the plumb lines have not been processed, the computer program returns to processing block 438 to get the next plumb line. If all plumb lines have been processed, the computer program proceeds to decision block 446 which determines whether all rows have been processed. If all the rows have not been processed, then the computer program returns to processing block 436 to process the next row in the table. If all rows have been processed, then the computer program proceeds to processing block 452.

Upon entering processing block 452 the table being processed in the page image holds blocks of alphanumeric data that are separated by tabs, the tabs being located along the plumb lines identified in processing block 432. Accordingly, the computer program in processing block 452 therefore creates a table of alphanumeric data columns separating the columns by tabs. After forming this table, the computer program in processing block 454 gets the next column in the newly formed table beginning with the first column. In decision block 456 the computer program determines whether all columns in the table have been processed. If all columns in the table have been processed, then the computer program proceeds to processing block 458, depicted on FIG. 16c, which prints all the tables to an output file.

If all columns of alphanumeric data separated by tabs have not been processed upon reaching decision block 456, then the computer program in processing block 462 initializes a Row Vector of blanks that has a length equal to the current column's width. After initializing the Row Vector, the computer program in processing block 464 gets the next row in the column. In decision block 466 the computer program determines whether all rows in the column have been processed. If all rows in the column have not been processed, then the computer program proceeds to processing block 468 which ORs the row selected in processing block 464 into the Row Vector. The computer program then proceeds to decision block 472, depicted in FIG. 16b, which determines if any gaps exist in the Row Vector that are wider than the Column Gap global variable. If any gaps exist in the Row Vector that are larger than the Column Gap global variable, the computer program returns to processing block 464 to get the next row vector in the column. If a gap wider than the Column Gap global variable does not exist in the Row Vector, then the computer program returns to processing block 462 to re-initialize the Row Vector. An absence of any gaps wider than the Column Gap global variable in the Row Vector means that the alphanumeric data in the column processed thus far completely fills the column width, and that therefore the column as processed thus far cannot be split into two columns.

In decision block 466 depicted in FIG. 16a, upon reaching the bottom of the column the computer program goes to decision block 482 depicted in FIG. 16b. The computer program in decision block 482 then determines if the Row Vector upon reaching the bottom of the column possesses a columnar structure. If the Row Vector does not possess a columnar structure, then the computer program returns from decision block 482 to processing block 454, depicted in FIG. 16a, to process the next column in the table because this column need not be split into multiple columns. However, if the Row Vector possesses a columnar structure, then the column must be split into multiple columns to properly organize the tabular data. Accordingly, the computer program detects a columnar structure in decision block 482 and in processing block 484 the computer program begins processing the column in the top row of the gap which creates in the columnar structure detected in decision block 482. In processing block 486 the computer program gets the next row downward in the column beginning with the top row of the gap. In decision block 488 the computer program determines whether all the rows in the column below the top of the gap have been processed. If all the rows in the column below the top of the gap have not been processed, then in processing block 492 the computer program replaces the gap in the current row together with any whitespace on either side of the gap with a tab character. After replacing this row's gap and surrounding whitespace with a tab character, the computer program returns to processing block 486 to get the next lower row in the column.

When in decision block 488 the computer program determines that all rows below the top of the gap have been processed, then the computer program proceeds to processing block 502 which starts processing at the top row of the gap. In processing block 504 the computer program selects the row in the column immediate¬ ly above the present row. In decision block 506 the computer program determines whether all the rows in the column above the gap have been processed. If all of the rows above the gap have not been processed, then the computer program in processing block 508 inserts a tab character to the right of the alphanumeric data. After inserting a tab to the right of the alphanumeric data, the computer program returns to processing block 504 to get the row immediately above the present one. If in decision block 506 the computer program determines that all rows in the column located above the gap have been processed, then the computer program returns to processing block 454 in FIG. 16a to get and process the next column.

Industrial Applicability The computer program is written in the C programming language that is well-known to those skilled in the art. The program has been demonstrated on a IBM® PC compatible digital computer. It is readily apparent to those skilled in the art that various other programming languages and/or digital computers could be used for alternative, equivalent implementations of the invention.

Although the present invention has been described in terms of the presently preferred embodiment, it is to be understood that such disclosure is purely illustrative and is not to be interpreted as limiting. Consequently, without departing from the spirit and scope of the invention, various alterations, modifica¬ tions, and/or alternative applications of the invention will, no doubt, be suggested to those skilled in the art after having read the preceding disclosure. Accordingly, it is intended that the following claims be interpreted as encompassing all alterations, modifications, or alternative applications as fall within the true spirit and scope of the invention.

Claims

The ClaimsWhat is claimed is:

1. A processor-based method for recognizing, capturing and storing into a database file tabular data of a document, the document having at least one page, the method comprising the steps of: receiving into a digital computer digital-computer data representing a document; processing within the digital computer the digital-computer data to locate tabular data present therein; extracting the tabular data from cells present in the digital-computer data; and storing into a database file present on the digital computer the extracted tabular data.

2. The processor-based method of claim 1 wherein the digital-computer data is received as a pixel-format document-image.

3. The processor-based method of claim 2 wherein optical character recognition is performed directly on a received pixel-format document-image to obtain formatted text, and the formatted text obtained by optical character recognition is processed to locate tabular data present therein.

4. The processor-based method of claim 2 wherein the pixel-format document-image is processed directly to locate tabular data present therein.

5. The processor-based method of claim 4 wherein process¬ ing the pixel-format document-image directly to locate tabular data present therein includes the steps of: evaluating a horizontal projection profile of the pixel-format document-image; determining upper and lower boundaries of a table by analyzing white space disclosed by the horizontal projection profile; evaluating a vertical projection profile of the pixel-format document-image; and determining a horizontal location of the table by analyzing white space disclosed by the vertical projection profile.

6. The processor-based method of claim 5 wherein optical character recognition is performed to obtain formatted text of the pixel-format document-image for which upper and lower boundaries and a horizontal location have been determined, and the formatted text obtained by optical character recognition is further processed to locate tabular data present therein.

7. The processor-based method of claim 5 wherein tabular data located in a pixel-format document-image is extracted directly from cells present in the pixel-format document-image, and further including the step of: before storing the tabular data into the database file present on the digital computer, performing optical character recognition on pixel-format document-images extracted from the cells to obtain the tabular data.

8. The processor-based method of claim 7 wherein extract¬ ing the tabular data from cells present in the pixel-format document-image includes the steps of: establishing regions in the pixel-format document-image using connected component analysis; evaluating a vertical projection profile of the regions to determine plumb lines between regions; and using the regions and plumb lines, determining cells from which pixel-format document-images are extracted for optical character recognition processing.

9. The processor-based method of claim 1 wherein the digital-computer data is received as formatted text.

10. The processor-based method of claim l wherein the digital-computer data is formatted text present in a text file which is processed to locate tabular data present therein.

11. The processor-based method of claim 10 wherein processing the formatted text to locate tabular data present therein includes the steps of: scanning the formatted text to identify a line containing a keyword; determining that the line containing the keyword is a first horizontal boundary of a table; determining a horizontal location of the table on the page; and determining a second horizontal boundary of the table.

12. The processor-based method of claim 11 wherein processing the formatted text to locate tabular data present therein includes the steps of: locating a small table in the formatted text; determining a horizontal location of the table on the page; and determining upper and lower boundaries for the table.

13. The processor-based method of claim 10 wherein processing the formatted text to locate tabular data present therein includes the steps of: locating a small table in the formatted text; determining a horizontal location of the table on the page; and determining upper and lower boundaries for the table.

14. The processor-based method of claim 1 wherein the digital-computer data is formatted text in which tabular data has been located, and the tabular data is extracted from cells present in the formatted text.

15. The processor-based method of claim 14 wherein extracting tabular data located in formatted text from cells present in the formatted text includes the steps of: establishing white-space-interval vectors for all text lines in which tabular data has been located; and forming a white-space-vector intersection by intersecting the white-space-interval vectors for all white-space-interval vectors having more than two white-space intervals.

16. The processor-based method of claim 15 wherein extracting tabular data located in formatted text from cells present in the formatted text further includes the step of: forming an augmented white-space-vector intersection by intersecting with the white-space-vector intersection a white-space-interval vector having only two white-space inter¬ vals; and if the augmented white-space-vector intersection has no fewer white-space intervals than the white-space-vector intersec¬ tion, then replacing the white-space-vector intersection with the augmented white-space-vector intersection.