WO2002013055A3 - Automatic categorization of documents based on textual content - Google Patents

Automatic categorization of documents based on textual content Download PDF

Info

Publication number
WO2002013055A3
WO2002013055A3 PCT/US2001/041669 US0141669W WO0213055A3 WO 2002013055 A3 WO2002013055 A3 WO 2002013055A3 US 0141669 W US0141669 W US 0141669W WO 0213055 A3 WO0213055 A3 WO 0213055A3
Authority
WO
WIPO (PCT)
Prior art keywords
document
category
textual content
documents
documents based
Prior art date
Application number
PCT/US2001/041669
Other languages
French (fr)
Other versions
WO2002013055A2 (en
Inventor
Frank Smadja
Original Assignee
Elron Software Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Elron Software Inc filed Critical Elron Software Inc
Priority to AU2001285432A priority Critical patent/AU2001285432A1/en
Publication of WO2002013055A2 publication Critical patent/WO2002013055A2/en
Publication of WO2002013055A3 publication Critical patent/WO2002013055A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99936Pattern matching access

Abstract

An electronic device automatically classifies documents based upon textual content. Documents may be classified into document categories. Statistical characteristics are gathered for each document category and these statistical characteristics are used as a frame of reference in determining how to classify the document. The document categories may be intersecting or non-intersecting. A neutral category is used to represent documents that do not fit fit into many of the other specified categories. The statistical characteristic for an input document are compared with those for the document category and for the neutral category in making a determination on how to categorize the document. This approach is extensible, generalizable and efficient.
PCT/US2001/041669 2000-08-09 2001-08-09 Automatic categorization of documents based on textual content WO2002013055A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001285432A AU2001285432A1 (en) 2000-08-09 2001-08-09 Automatic categorization of documents based on textual content

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/635,714 US6621930B1 (en) 2000-08-09 2000-08-09 Automatic categorization of documents based on textual content
US09/635,714 2000-08-09

Publications (2)

Publication Number Publication Date
WO2002013055A2 WO2002013055A2 (en) 2002-02-14
WO2002013055A3 true WO2002013055A3 (en) 2003-09-18

Family

ID=24548813

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/041669 WO2002013055A2 (en) 2000-08-09 2001-08-09 Automatic categorization of documents based on textual content

Country Status (3)

Country Link
US (1) US6621930B1 (en)
AU (1) AU2001285432A1 (en)
WO (1) WO2002013055A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294506B (en) * 2015-06-10 2020-04-24 华中师范大学 Domain-adaptive viewpoint data classification method and device

Families Citing this family (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2404337A1 (en) * 2000-03-27 2001-10-04 Documentum, Inc. Method and apparatus for generating metadata for a document
US20070027672A1 (en) * 2000-07-31 2007-02-01 Michel Decary Computer method and apparatus for extracting data from web pages
US6618717B1 (en) * 2000-07-31 2003-09-09 Eliyon Technologies Corporation Computer method and apparatus for determining content owner of a website
US20020091671A1 (en) * 2000-11-23 2002-07-11 Andreas Prokoph Method and system for data retrieval in large collections of data
CN1240011C (en) * 2001-03-29 2006-02-01 国际商业机器公司 File classifying management system and method for operation system
US20030130993A1 (en) * 2001-08-08 2003-07-10 Quiver, Inc. Document categorization engine
JP3997774B2 (en) * 2001-12-11 2007-10-24 ソニー株式会社 Data processing system, data processing method, information processing apparatus, and computer program
US7024624B2 (en) 2002-01-07 2006-04-04 Kenneth James Hintz Lexicon-based new idea detector
US7409404B2 (en) * 2002-07-25 2008-08-05 International Business Machines Corporation Creating taxonomies and training data for document categorization
US7743061B2 (en) * 2002-11-12 2010-06-22 Proximate Technologies, Llc Document search method with interactively employed distance graphics display
US20040122660A1 (en) * 2002-12-20 2004-06-24 International Business Machines Corporation Creating taxonomies and training data in multiple languages
US20040162824A1 (en) * 2003-02-13 2004-08-19 Burns Roland John Method and apparatus for classifying a document with respect to reference corpus
US8266215B2 (en) 2003-02-20 2012-09-11 Sonicwall, Inc. Using distinguishing properties to classify messages
US7299261B1 (en) * 2003-02-20 2007-11-20 Mailfrontier, Inc. A Wholly Owned Subsidiary Of Sonicwall, Inc. Message classification using a summary
US7146361B2 (en) 2003-05-30 2006-12-05 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
US20040243554A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis
US7139752B2 (en) * 2003-05-30 2006-11-21 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and providing multiple document views derived from different document tokenizations
US20040243556A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, and including a document common analysis system (CAS)
US20040243560A1 (en) * 2003-05-30 2004-12-02 International Business Machines Corporation System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching
WO2004114160A2 (en) * 2003-06-13 2004-12-29 Equifax, Inc. Systems and processes for automated criteria and attribute generation, searching, auditing and reporting of data
US7734627B1 (en) * 2003-06-17 2010-06-08 Google Inc. Document similarity detection
US20090100138A1 (en) * 2003-07-18 2009-04-16 Harris Scott C Spam filter
WO2005010727A2 (en) * 2003-07-23 2005-02-03 Praedea Solutions, Inc. Extracting data from semi-structured text documents
CA2536097A1 (en) * 2003-08-27 2005-03-10 Equifax, Inc. Application processing and decision systems and processes
US11132183B2 (en) 2003-08-27 2021-09-28 Equifax Inc. Software development platform for testing and modifying decision algorithms
US7245765B2 (en) * 2003-11-11 2007-07-17 Sri International Method and apparatus for capturing paper-based information on a mobile computing device
US8693043B2 (en) * 2003-12-19 2014-04-08 Kofax, Inc. Automatic document separation
US7975240B2 (en) * 2004-01-16 2011-07-05 Microsoft Corporation Systems and methods for controlling a visible results set
US7624274B1 (en) * 2004-02-11 2009-11-24 AOL LLC, a Delaware Limited Company Decreasing the fragility of duplicate document detecting algorithms
US7725475B1 (en) 2004-02-11 2010-05-25 Aol Inc. Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems
US7392262B1 (en) 2004-02-11 2008-06-24 Aol Llc Reliability of duplicate document detection algorithms
US7444380B1 (en) 2004-07-13 2008-10-28 Marc Diamond Method and system for dispensing and verification of permissions for delivery of electronic messages
US7496567B1 (en) 2004-10-01 2009-02-24 Terril John Steichen System and method for document categorization
US10803126B1 (en) * 2005-01-13 2020-10-13 Robert T. and Virginia T. Jenkins Method and/or system for sorting digital signal information
US7266562B2 (en) * 2005-02-14 2007-09-04 Levine Joel H System and method for automatically categorizing objects using an empirically based goodness of fit technique
US7593904B1 (en) * 2005-06-30 2009-09-22 Hewlett-Packard Development Company, L.P. Effecting action to address an issue associated with a category based on information that enables ranking of categories
US8719073B1 (en) 2005-08-25 2014-05-06 Hewlett-Packard Development Company, L.P. Producing a measure regarding cases associated with an issue after one or more events have occurred
US8423908B2 (en) * 2006-09-08 2013-04-16 Research In Motion Limited Method for identifying language of text in a handheld electronic device and a handheld electronic device incorporating the same
US7885466B2 (en) * 2006-09-19 2011-02-08 Xerox Corporation Bags of visual context-dependent words for generic visual categorization
CA2921562C (en) * 2007-08-07 2017-11-21 Equifax, Inc. Systems and methods for managing statistical expressions
US9082080B2 (en) * 2008-03-05 2015-07-14 Kofax, Inc. Systems and methods for organizing data sets
US20100121842A1 (en) * 2008-11-13 2010-05-13 Dennis Klinkott Method, apparatus and computer program product for presenting categorized search results
US20100121790A1 (en) * 2008-11-13 2010-05-13 Dennis Klinkott Method, apparatus and computer program product for categorizing web content
US8392175B2 (en) * 2010-02-01 2013-03-05 Stratify, Inc. Phrase-based document clustering with automatic phrase extraction
US8996350B1 (en) 2011-11-02 2015-03-31 Dub Software Group, Inc. System and method for automatic document management
US9298814B2 (en) 2013-03-15 2016-03-29 Maritz Holdings Inc. Systems and methods for classifying electronic documents
US11928606B2 (en) 2013-03-15 2024-03-12 TSG Technologies, LLC Systems and methods for classifying electronic documents
US9053392B2 (en) * 2013-08-28 2015-06-09 Adobe Systems Incorporated Generating a hierarchy of visual pattern classes
US9881079B2 (en) * 2014-12-24 2018-01-30 International Business Machines Corporation Quantification based classifier
WO2016172288A1 (en) * 2015-04-21 2016-10-27 Lexisnexis, A Division Of Reed Elsevier Inc. Systems and methods for generating concepts from a document corpus
US11783005B2 (en) 2019-04-26 2023-10-10 Bank Of America Corporation Classifying and mapping sentences using machine learning
US11429897B1 (en) 2019-04-26 2022-08-30 Bank Of America Corporation Identifying relationships between sentences using machine learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000026795A1 (en) * 1998-10-30 2000-05-11 Justsystem Pittsburgh Research Center, Inc. Method for content-based filtering of messages by analyzing term characteristics within a message

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0786914B2 (en) * 1986-11-07 1995-09-20 株式会社日立製作所 Change detection method using images
US5479533A (en) * 1992-02-28 1995-12-26 Yamatake-Honeywell Co., Ltd. Pattern recognition apparatus and method using fuzzy logic
US5581630A (en) * 1992-12-21 1996-12-03 Texas Instruments Incorporated Personal identification
DE69331518T2 (en) * 1993-02-19 2002-09-12 Ibm Neural network for comparing features of image patterns
US5978620A (en) * 1998-01-08 1999-11-02 Xerox Corporation Recognizing job separator pages in a document scanning device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000026795A1 (en) * 1998-10-30 2000-05-11 Justsystem Pittsburgh Research Center, Inc. Method for content-based filtering of messages by analyzing term characteristics within a message

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HOCH R: "USING IR TECHNIQUES FOR TEXT CLASSIFICATION IN DOCUMENT ANALYSIS", SIGIR '94. DUBLIN, JULY 3 - 6, 1994, PROCEEDINGS OF THE ANNUAL INTERNATIONAL ACM-SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, BERLIN, SPRINGER, DE, vol. CONF. 17, 3 July 1994 (1994-07-03), pages 31 - 40, XP000475312 *
MAAREK Y S ET AL: "FULL TEXT INDEXING BASED ON LEXICAL RELATIONS AN APPLICATION: SOFTWARE LIBRARIES", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL. (SIGIR). CAMBRIDGE, MA., JUNE 25 - 28, 1989, READING, ACM, US, vol. CONF. 12, 25 June 1989 (1989-06-25), pages 198 - 206, XP000239149 *
ROIGER R ET AL: "Selecting training instances for supervised classification", ITESM, XP010509261 *
SMADJA F: "RETRIEVING COLLOCATIONS FROM TEXT: XTRACT", COMPUTATIONAL LINGUISTICS, CAMBRIDGE, MA, US, vol. 19, no. 1, 1993, pages 143 - 177, XP000905567 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294506B (en) * 2015-06-10 2020-04-24 华中师范大学 Domain-adaptive viewpoint data classification method and device

Also Published As

Publication number Publication date
WO2002013055A2 (en) 2002-02-14
AU2001285432A1 (en) 2002-02-18
US6621930B1 (en) 2003-09-16

Similar Documents

Publication Publication Date Title
WO2002013055A3 (en) Automatic categorization of documents based on textual content
WO2002008933A3 (en) System and method for automated classification of text by time slicing
EP1503300A3 (en) Vision-based document segmentation
EP1528486A3 (en) Classification evaluation system, method, and program
WO2004061572A3 (en) Adaptive classification of network traffic
WO2002082321A3 (en) Method and system for archiving data files
EP2293204A3 (en) Methods and systems for transitioning between thumbnails and documents based upon thumbnail appearance
EP0763818A3 (en) Formant emphasis method and formant emphasis filter device
WO2006042265A8 (en) System and method for facilitating network connectivity based on user characteristics
EP1646121A3 (en) Overcurrent detection method and detection circuit
EP1691548A3 (en) Data slicer, data slicing method, and amplitude evaluation value setting method
EP0810535A3 (en) Document retrieval system
WO2005008439A3 (en) San/storage self-healing/capacity planning system and method
EP1469399A3 (en) Updated data write method using a journaling filesystem
EP1416358A3 (en) Apparatus and method for managing power in computer system
WO2004075093A3 (en) Music feature extraction using wavelet coefficient histograms
EP1069739A3 (en) Removal of a common mode voltage in a differential receiver
EP1156587A3 (en) Method and apparatus for detecting switch closures
EP1455299A3 (en) Device and method for binarizing image
EP1599058A3 (en) Spread communication system and mobile station thereof
WO1999035778A3 (en) Low level content filtering
EP0845864A3 (en) Level converter and semiconductor device
EP0898363A3 (en) Surface acoustic wave device
EP0933679A3 (en) Photographic processing apparatus and method
EP0881778A3 (en) Ternary signal input circuit

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP