WO2011132204A2 - Fetch engine - Google Patents

Fetch engine Download PDF

Info

Publication number
WO2011132204A2
WO2011132204A2 PCT/IN2011/000263 IN2011000263W WO2011132204A2 WO 2011132204 A2 WO2011132204 A2 WO 2011132204A2 IN 2011000263 W IN2011000263 W IN 2011000263W WO 2011132204 A2 WO2011132204 A2 WO 2011132204A2
Authority
WO
WIPO (PCT)
Prior art keywords
snippets
server
websites
engine
information
Prior art date
Application number
PCT/IN2011/000263
Other languages
French (fr)
Other versions
WO2011132204A3 (en
Inventor
Ajay Sethi
Siddhartha Reddy
Dhayanithi Subramanian
Original Assignee
Ajay Sethi
Siddhartha Reddy
Dhayanithi Subramanian
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ajay Sethi, Siddhartha Reddy, Dhayanithi Subramanian filed Critical Ajay Sethi
Publication of WO2011132204A2 publication Critical patent/WO2011132204A2/en
Publication of WO2011132204A3 publication Critical patent/WO2011132204A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

This invention relates to collating information using communication means. Embodiments herein disclose a method for collating information from a plurality of entities on the World Wide Web (WWW) comprising of a Fetch Engine server extracting a plurality of snippets of information from the plurality of websites; the fetch engine server transforming the plurality of snippets; and the fetch engine server merging the transformed plurality of snippets to construct the entity, Further disclosed herein is a server for collating information from a plurality of websites on the world wide web, the server comprising of a crawling and extraction engine for extracting a plurality of snippets of information from the plurality of websites; a transformation engine for cleansing and transforming the plurality of snippets; and a data integration engine for merging the transformed plurality of snippets into an entity.

Description

Fetch Engine
FIELD OF INVENTION
[001] This invention relates to collating information using communication means. BACKGROUND OF INVENTION
[002] WWW is the largest repository of the structured and unstructured information on the planet. WWW, therefore, can be used to construct a repository of structured information for any category - local data (for example, information about hospitals and doctors in a city; best colleges in a country; etc.), music data (information about album metadata such as artist names, etc.), etc. So far, the process of data aggregation has relied heavily on manual processing of data.
[003] On the other, websites such as aggregation engines (such as Google News), search portals offer a listing of results (which may be search results, articles/news items from a variety of sources), with links to the sources of information. Some websites also offer snippets from the results. The snippets may be the portions of the linked web page. For example, in a search portal, the snippets may be the portions of the linked web page, where the keywords used for the searching appear in the search result. In another example, in a news aggregation engine, the listing of the web pages may comprise of the headline of the linked web page and the first few lines of the linked web page. SUMMARY
[004] Embodiments herein disclose a method for collating information from a plurality of websites on the world wide web comprising of a fetch engine server extracting a plurality of snippets of information from the plurality of websites; the fetch engine server cleansing and transforming the plurality of snippets; and the fetch engine server merging the transformed plurality of snippets into an entity. The method further comprises of the fetch engine server identifying a plurality of websites where the plurality of websites may contain required entities and snippets of information; and the fetch engine server crawling the identified plurality of websites to check if the identified plurality of websites contain the required snippets of information corresponding to the required entities and snippets of information. The fetch engine server extracting a plurality of snippets of information from the plurality of websites further comprises of the fetch engine server reducing page layouts in the plurality of websites to semantic abstractions; and the fetch engine server dividing the plurality of websites into a plurality of logical segments. The fetch engine server transforming the plurality of snippets comprises of the fetch engine server filtering out snippets of information which do not meet pre-specified criteria; and the fetch engine server standardizes values present in attributes of the snippets. The method further comprises of assigning at least one finger print to each snippet; marking entities as potentially- similar if at least one fingerprint is common between the plurality of snippets. The fetch engine server may store the entity into a database. The fetch engine server may record metadata values associated with each of the plurality of entities. The fetch engine server may create links between a plurality of the entities.
[005] Further disclosed herein is a server for collating information from a plurality of websites on the world wide web, the server comprising of a crawling and extraction engine for extracting a plurality of snippets of information from the plurality of websites; a transformation engine for cleansing and transforming the plurality of snippets; and a data integration engine for merging the transformed plurality of snippets into an entity. The crawling and extraction engine identifies a plurality of websites where the plurality of websites may contain required entities and snippets of information; and crawls the identified plurality of websites to check if the identified plurality of websites contains the required snippets of information corresponding to the required entities and snippets of information. The crawling and extraction engine also reduces page layouts in the plurality of websites to semantic abstractions; and divides the plurality of websites into a plurality of logical segments. The transformation engine filters out snippets of information which do not meet pre- specified criteria; and standardizes values present in attributes of the snippets. The server further comprising a de-duplication engine for assigning at least one finger print to each snippet; marking entities as potentially-similar if at least one fingerprint is common between the plurality of snippets. The server may store the merged snippet into a database. The server may record metadata values associated with each of the plurality of entities. The server may create links between a plurality of the entities.
[006] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments herein without departing from the spirit thereof, and the embodiments herein include all such modifications. BRIEF DESCRIPTION OF FIGURES
[007] This invention is illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:
[008] Fig. 1 depicts a system for aggregating data from multiple sources on the web, according to embodiments as disclosed herein;
[009] FIG. 2 depicts a fetch engine server, according to embodiments as disclosed herein;
[0010] FIGs. 2a, 2b and 2c are examples, according to embodiments as disclosed herein;
[0011] FIG. 3 depicts a crawling and extraction engine, according to embodiments as disclosed herein;
[0012] FIG. 4 depicts a transformation engine, according to embodiments as disclosed herein;
[0013] FIGs. 4a, 4b, 4c and 4d are examples, according to embodiments as disclosed herein;
[0014] FIG. 5 depicts a de-duplication engine, according to embodiments as disclosed herein;
[0015] FIG. 6 depicts a data integration engine, according to embodiments as disclosed herein;
[0016] FIG. 6a is an example, according to embodiments as disclosed herein;
[0017] FIG. 7 is a flowchart depicting the process as performed by the fetch engine server, according to embodiments as disclosed herein; [0018] FIG. 8 is a flowchart depicting the process as performed by the crawling and extraction engine, according to embodiments as disclosed herein;
[0019] FIG. 9 depicts the process as performed by a transformation engine, according to embodiments as disclosed herein;
[0020] FIG. 10 depicts the process as performed by a de-duplication engine, according to embodiments as disclosed herein;
[0021] FIG. 11 depicts a search portal offering results to a user, according to embodiments as disclosed herein;
[0022] FIGs. 12a and 12b depict a flowchart illustrating the actions taken on receiving a search query, according to embodiments as disclosed herein; and
[0023] FIG. 13 depicts a flow chart illustrating a method of searching and storing a result, according to embodiments as disclosed herein.
DETAILED DESCRIPTION OF INVENTION
[0024] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein may be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the embodiments herein.
[0025] Embodiments herein describe a Fetch Engine system to build a repository of structured information for any category in an automated way. The system consists of a crawling and extraction system (which identifies and extracts relevant "raw" information from the Web), data transformation engine (which cleanses the raw data and standardizes it to a canonical form), data de-duplication engine (which determines whether entities obtained from different websites correspond to the same logical entity), and data integration engine (which eliminates duplicates, merges the entity attributes to construct the final record, etc.).
[0026] Various Fetch Engine components can be trained to understand the key characteristics of structured data belonging to different categories/verticals. For example, the data extraction system can be trained to split an address of a POI (Point- Of-Interest; or a "local data" entity) into house-number, street name, locality, city, zip-code, etc. As another example, the system can be trained to identify and understand album name, track name, artist name, etc. (corresponding to the music and multi-media metadata). Likewise, each component of the Fetch Engine server can be trained and fine-tuned to handle structured data for various categories.
[0027] Embodiments herein define a generic system to build an exhaustive repository of entities with the help of information available on the Web and/or other sources.
[0028] Fig. 1 depicts a system for aggregating data from multiple sources on the web, according to embodiments as disclosed herein. The system, as depicted, comprises of a fetch engine server 103, the WWW 104, a plurality of websites 105 and a database 106. The fetch engine server 103 crawls the WWW 104 for websites 105 related to a specific topic, extracts snippets of relevant information from the web pages and composes an entity (which may be composed of snippets from multiple web pages). An entity is a record (comprising of structured and unstructured information) corresponding to a concept (for example, a place/location in the real- world, a song/video in the digital format, etc.). Since information about a single concrete entity is typically mentioned by multiple websites, multiple snippets are identified and extracted for each entity. These snippets provide "raw" data (= possibly incorrect (old/obsolete/spurious/erroneous) information in non-standard format) about the concept. The fetch engine server 103 functions independently of the structure of the websites 105 and does not have to change with changes in the layout of the websites 105. The fetch engine server 103 may store the information in the database 106 for future reference.
[0029] FIG. 2 depicts a fetch engine server, according to embodiments as disclosed herein. The fetch engine server 103 comprises of a crawling and extraction engine 201, a transformation engine 202, a de-duplication engine 203 and a data integration engine 204. The crawling and extraction engine 201 crawls the WWW and identifies websites which contain specific information. In an embodiment herein, the crawling and extraction engine 201 may crawl any web site available on the Web and/or specifically identified websites for the information. The crawling and extraction engine 201 extracts snippets from the websites. The snippets may be structured information (such as address, phone-numbers, artist name, album name, etc.) or phrases, a sequence of words/phrases, or any set of characters which provide relevant information.
[0030] FIGs. 2a, 2b and 2c depict the process of extraction from a webpage. In FIGs. 2a and 2b, the details of the song including the name of the song, name of the film, name of the singers, name of the lyricist, music composer are extracted from the webpage. In FIG. 2c, the categories, name and address of an organization are extracted from the webpage.
[0031] The transformation engine 202 cleanses the extracted snippets. The transformation engine 202 filters the snippets that violate specified requirements/criteria. The transformation engine 202 further standardizes the snippets into standard phrases and standard values. The transformation engine 202 may also generate derived metadata. The transformation engine 202 may derive the derived metadata using the raw metadata associated with each snippet.
[0032] The de-duplication engine 203 identifies and removes the duplicates from the extracted snippets.
[0033] The data integration engine 204 assembles the entity from the extracted snippets. The data integration engine 204 may also be used for performing multi- record analysis and for deriving additional metadata from the entities.
[0034] FIG. 3 depicts a crawling and extraction engine, according to embodiments as disclosed herein. The crawling and extraction engine 201, as depicted, comprises of a crawling engine 301 and an extraction engine 302. The crawling engine 301 crawls the WWW to identify websites which contain relevant information. The crawling engine 301 may use focused crawling, sitemap abstraction, exponential sampling or any other suitable mechanism to identify the websites which have to be checked for the information. The crawling engine 301 works on some domain-specific features which not only takes a website's own score, but also that of linked pages, to determine the crawl priority. This ensures that useful websites and websites that beget the most useful pages are prioritized for crawling and extraction. The identified websites are passed onto the extraction engine 302. The extraction engine 302 uses approximate techniques to determine if an identified website may have an answer. If an identified website has the answer, the extraction engine 302 extracts the snippet from the website corresponding to the relevant information. The extraction engine 302 may identify the snippet by first reducing the various HTML/DOM expressions present in a webpage on the website into semantic abstractions. The extraction engine 302 then divides the webpage into logical segments and operates on high-information segments as identified by structural and measure-based clues. The extraction engine 302 further composes the snippets from multiple segments based on a plurality of factors comprising of collocation, co- reference, etc.
[0035] FIG. 2c depicts the process of extraction from a webpage, where the categories, name and address of an organization are extracted from the webpage.
[0036] FIG. 4 depicts a transformation engine, according to embodiments as disclosed herein. The transformation engine 202 comprises of a cleaning and standardization engine 401, a segmentation engine 402 and a fingerprinting engine 403. The cleaning and standardized engine 401 filters out websites/snippets that violate or do not satisfy pre-specified requirements. The requirements may be set by an administrator or an operator. The cleaning and standardized engine 401 standardizes all attribute values present in the snippets to a standardized format. For example, the cleaning and standardized engine 401 may modify dates which are present in the snippet and which use various formats may be amended to a standard format; date formats may be written as 20/04/2010, 04/20/2010, April 20, 2010, 20 April 2010 may be standardized to 20 April, 2010. The cleaning and standardization engine 401 creates groups of similar attribute values based on various fuzzy matching techniques such as phonetic (soundex, etc.) similarity (with specialized support for Indian languages), edit-distance differences, stop word removal, support for handling abbreviations and common spelling variations, stemming (handle plurals and other variations by identify the 'root' word), etc. The cleaning and standardization engine 401 performs the above processing separately for each attribute to ensure that thresholds can be controlled precisely.
[0037] The fingerprinting engine 403 generates fingerprints for each snippet.
The fingerprinting engine 403 generates fingerprints based on fuzzy matching of all relevant attribute values of the snippet. For each pair of fuzzily similar attribute values, the fingerprinting engine 403 assigns a fingerprint to the snippet. The fingerprinting engine 403 may assign multiple fingerprints to reach snippet. As an example, consider the popular Bollywood song "Aaj ki raat". The metadata corresponding to this song is available across multitude of websites. Based on the extracted information, it is possible to generate fingerprints shown below:
• Track: Aaj ki raat; Album: Ram Aur Shyam; Artists: Mohd Rati; Music:
Naushad, Year: 1967 • Track: Aaj ki raat mere dil; Album: Ram Aur Shyam; Artists: Mohd Rafi; Music: Naushad, Year: 1967
• Track: Aaj ki raat; Album: Dharam Putra; Artists: Mahendra Kapoor;
Music: N Dutta
• Track: Aaj ki raat; Album: Anamika; Artists: Asha Bhosle
• Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Asha Bhosle;
Music: R D Burman
• Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Lata Mangeshkar;
Music: R D Burman
• Track: Aaj ki raat; Album: Anamika; Artists: Asha Bhosle
• Track: Aaj ki raat koi aane ko remix; Album: Asha Reveals Real RD;
Artists: Asha Bhosle; Music: R D Burman
• Track: Aaj ki raat remix; Album: Kronos Caravan; Artists: Asha Bhosle;
Music: David Harrington; Year: 2005
• Track: Aaj ki raat; Album: You have Stolen My Heart - Songs from R D Burman; Artists: Asha Bhosle; Music: R D Burman
• Track: Aaj ki raat; Album: Nai Umar ki Nai Fasal; Artists: Mohd Rafi;
Music: Roshan
• Track: Aaj ki raat; Album: Vishwas; Artists: Lata Mangeshkar; Music:
Kalyanji Anandji
• Track: Aaj ki raat khona hai kya; Album: Don; Artists: Alisha Chinoi, Sonu Nigam, etc.; Music: Shankar Ehsaan Loy; Year: 2007
• Track: Aaj ki raat; Album: Don ; Artists: Alisha Chinoi, Sonu Nigam, Mahalaxmi Iyer; Lyrics: Javed Akhtar; Year: 2006 • Track: Aaj ki raat; Album: Don - The Chase Begins Again; Artists: Alisha Chinoi, Sonu Nigam; Year: 2006
• Track: Aaj ki raat; Album: Slumdog Millionaire; Artists: Alisha Chinoi, Sonu Nigam; Music: A R Rahman; Year: 2008
• Track: Aaj ki raat; Album: Slumdog Millionaire; Artists: Alisha Chinoi, Mahalakshmi Iyer, Sonu Nigam; Music: A R Rehman
• Track: Aaj ki raat; Album: Going to be Wild; Artists: Sonu Nigam, Sudesh Bhosle; Year: 2008
• Track: Aaj ki raat; Album: Coke ringtone; Artists: Imran Khan, Marcela Rodrigues; Year: 2010
[0038] The classification engine 602 (which is part of the data integration engine 204) further assigns the entities to the correct categories. FIGs.4a and 4b depict similar snippets from different websites which are belonging to different categories.
[0039] FIG. 5 depicts a de-duplication engine, according to embodiments as disclosed herein. The de-duplication engine 203 comprises of an aggregation engine 501 and a clustering engine 502. The aggregation engine 501 groups all potential duplicate snippets into one group. The aggregation engine 501 may group all potential duplicate snippets into one group by identifying all potentially overlapping fingerprints and grouping all snippets that have at least one pair of overlapping fingerprints. The aggregation engine 501 merges multiple snippets into a single 'logical' entity (corresponding to a unique concept), called as the master snippet for that group. For example, consider a finger-group with following three snippets: • Rl : Track: Aaj ki raat khona hai kya; Album: Don; Artists: Alisha Chinoi, Sonu Nigam, etc.; Music: Shankar Ehsaan Loy; Year: 2007
• R2: Track: Aaj ki raat; Album: Don ; Artists: Alisha Chinoi, Sonu Nigam, Mahalaxmi Iyer; Lyrics: Javed Akhtar: Year: 2006
· R3: Track: Aaj ki raat; Album: Don - The Chase Begins Again; Artists:
Alisha Chinoi, Sonu Nigam; Year: 2006
[0040] If Track, Album, and Artists are the three most important attributes, the aggregation engine 501 will treat Rl as the 'base' snippet for constructing the master snippet for the group:
Track: Aaj ki raat; Album: Don; Artists: Alisha Chinoi, Sonu Nigam,
Mahalaxmi Iyer, etc.; Music: Shankar Ehsaan Loy; Lyrics: Javed Akhtar; Year: 2006
[0041] The aggregation engine 501 may identify identical snippets extracted from multiple websites. The aggregation engine 501 assigns scores with each attribute values. Scores are based on "support" and "confidence" associated with each attribute/partner. Depending on attribute-type, relevant attribute values are selected. Two types of attributes are present; additive or exclusive. Based on the attribute type, a "winner" (with the highest score) or union values (with above-threshold scores) is picked.
[0042] In the example above, assume Track, Album, Music, Lyrics, and Year to be 'exclusive' attributes (with one final value); while assume Artists to be an 'additive' group. Consider the master snippet described earlier:
Track: Aaj ki raat; Album: Don; Artists: Alisha Chinoi, Sonu Nigam, Mahalaxmi Iyer, etc.; Music: Shankar Ehsaan Loy; Lyrics: Javed Akhtar; Year: 2006. [0043] Since there are no conflicts re: "Music: Shankar Ehsaan Loy" and "Lyrics: Javed Akhtar", these attributes are added to master snippet to enrich metadata. Since "Artists" is an additive "Mahalaxmi Iyer" is added as an Artist. All other attributes are exclusive; in this case, "Track: Aaj ki raat", "Album: Don", and "Year: 2006" are computed to be the winners.
[0044] The clustering engine 502 generates clusters based on the initial fingerprint groups and identifies the snippets that correspond to a logical entity. It, therefore, helps to remove spurious groups created by the fingerprinting process. At the end of this step, the server generates and assigns a unique ID to each master snippet.
[0045] In another embodiment herein, the de-duplication engine 203 may also be based on business rules associated with the snippet and the website.
[0046] FIGs 4c and 4d depict exemplary snippets of the same information, which vary from webpage to webpage. As can be seen from the figures, the formats of the street and phone number vary.
[0047] FIG. 6 depicts a data integration engine, according to embodiments as disclosed herein. The data integration engine 204 comprises of a quality scoring engine 601, a classification engine 602, a merging engine 603 and a field scoring engine 604. The merging engine 603 merges the available snippets into the entity. The merging may be done on the basis of the confidence associated with the attributes of each snippet. The merging engine 603 may pick the attribute with the highest confidence. The classification engine 602 creates 'links' between different entities (within and across categories) by identifying entities corresponding to the same or related topic within or across categories. For example: the platform identifies ringtone, ringback tone, full track download, video, etc. corresponding to a specific song. 'Links' between entities, therefore, provide useful metadata for building user profile, discovery, recommendation, analytics, etc. Further, the classification engine 602 supports statistical and analytical analysis for generating global metadata. For example, ensuring the completeness and correctness of metadata across content providers.
[0048] FIG. 6a depicts an entity comprising of two snippets which have been merged.
[0049] The classification engine 602 may also create 'links' between different snippets (within and across categories). For example, consider the following set of snippets from the earlier fingerprint group:
• Track: Aaj ki raat; Album: Anamika; Artists: Asha Bhosle
• Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Asha Bhosle; Music:
R D Burman
• Track: Aaj ki raat; Album: Anamika; Artists: Asha Bhosle
Track: Aaj ki raat koi aane ko remix; Album: Asha Reveals Real RD; Artists:
Asha Bhosle; Music: R D Burman
• Track: Aaj ki raat remix; Album: Kronos Caravan; Artists: Asha Bhosle;
Music: David Harrington; Year: 2005
• Track: Aaj ki raat; Album: You have Stolen My Heart - Songs from R D
Burman; Artists: Asha Bhosle; Music: R D Burman
[0050] Due to the common "Track" and "Artists", the classification engine 602 identifies these songs to be strongly related to each other and may merge snippets 1 - 3 are merged into a single 'logical' snippet. Snippets 4 - 6, however, are split into distinct groups (corresponding to different entities) due to distinct "Album" names. The classification engine 602, however, creates a 'link' between each of these songs and the single 'logical' entity (corresponding to snippets 1 - 3).
[0051] FIG. 7 is a flowchart depicting the process as performed by the fetch engine server, according to embodiments as disclosed herein. The crawling and extraction engine 201 crawls (701) the WWW to identify websites which contain specific information. In an embodiment herein, the crawling and extraction engine 201 may crawl specific identified websites on the WWW for the information. The crawling and extraction engine 201 extracts (702) snippets from the websites. A transformation engine 202 cleanses (703) the extracted snippets. The de-duplication engine 203 de-duplicates (704) the extracted snippets by identifying and removing the duplicates from the extracted snippets. The data integration engine 204 assembles (705) the entity from the extracted snippets. The various actions in method 700 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 7 may be omitted.
[0052] FIG. 8 is a flowchart depicting the process as performed by the crawling and extraction engine, according to embodiments as disclosed herein. The crawling engine 301 identifies (801) websites which contain relevant information and crawls (802) the identified set of websites. The crawling engine 301 may use focused crawling, sitemap abstraction, exponential sampling or any other suitable mechanism to identify the websites which have to be checked for the information. The crawling engine 301 works on some domain-specific features which not only takes a website's own score, but also that of linked pages, to determine the crawl priority. This ensures that useful websites and websites that beget the most useful pages are prioritized for crawling and extraction. The identified websites are passed onto the extraction engine 302. The extraction engine 302 checks (803) if an identified website may have at least part of the required information. The extraction engine first reduces (804) the various HTML/DOM expressions present in the website into semantic abstractions. The extraction engine 302 then divides (805) the webpage present in the website into logical segments and operates on high-information segments as identified by structural and measure-based clues. The extraction engine 302 further composes (806) the snippets from multiple segments based on a plurality of factors comprising of collocation, co-reference, etc. The various actions in method 800 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 8 may be omitted.
[0053] FIG. 9 depicts the process as performed by a transformation engine, according to embodiments as disclosed herein. The cleaning and standardized engine 401 filters (901) out snippets/entities that violate or do not satisfy pre-specified requirements. The requirements may be set by an administrator or an operator. The cleaning and standardized engine 401 standardizes (902) all attribute values to a standardized format by creating groups of similar attribute values based on various fuzzy matching techniques such as phonetic (soundex, etc.) similarity (with specialized support for Indian languages), edit-distance differences, stop word removal, support for handling abbreviations and common spelling variations, stemming (handle plurals and other variations by identify the 'root' word), etc. The cleaning and standardization engine 401 performs the above processing separately for each attribute to ensure that thresholds can be controlled precisely. The various actions in method 900 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 9 may be omitted. [0054] FIG. 10 depicts the process as performed by a de-duplication engine, according to embodiments as disclosed herein. The aggregation engine 501 groups (1001) all potential duplicate snippets into one group. The aggregation engine 501 may group all potential duplicate snippets into one group by identifying all potentially overlapping fingerprints and grouping all snippets that have at least one pair of overlapping fingerprints. The aggregation engine 501 constructs (1002) multiple snippets into a single 'logical' snippet (corresponding to a unique snippet), called as the master snippet for that group by merging multiple snippets. The aggregation engine 501 identifies (1003) identical snippets provided by multiple content partners and merges (1004) the metadata values of the identified snippets. The clustering engine 502 generates and assigns (1005) a unique ID to each master snippet. The various actions in method 1000 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 10 may be omitted.
[0055] In an embodiment herein, while creating the entity, the fetch engine server 103 keeps track of which metadata values were contributed by which source snippets. In an embodiment herein, the fetch engine server 103 keeps a track of the flow of metadata across the entire process. This enables the fetch engine server 103 to track every attribute value to the source of the metadata, further helping in computing source level confidence and other analytics metadata.
[0056] In an embodiment herein, the fetch engine server 103 verifies that the metadata associated with various snippets is correct and complete. In an example, consider the sub-group:
• Track: Aaj ki raat; Album: Anamika; Music: R D Burman • Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Asha Bhosle; Music: R D Burman
• Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Lata Mangeshkar;
Music: R D Burman
· Track: Aaj ki raat; Album: Anamika; Artists: Asha Bhosle
[0057] Snippets 2 and 4 mention "Asha Bhosle" but snippet 3 mentions "Lata Mangeshkar" to be the artist. In other words, two snippets "support" Asha and one snippet "supports" Lata. In order to pick the correct value with more confidence, the fetch engine server 103 may derive additional support from the Web. Likewise support can be derived for "Aaj ki raat" versus "Aaj ki raat koi anne ko" track names.
[0058] In another embodiment herein, the fetch engine server 103 tags snippets with metadata extracted from the Web (and not available across any content partner). In an example, consider the sub-group:
• Track: Aaj ki raat; Album: Anamika; Music: R D Burman
· Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Asha Bhosle; Music:
R D Burman
• Track: Aaj ki raat koi aane ko; Album: Anamika; Artist: Lata Mangeshkar;
Music: R D Burman
• Track: Aaj ki raat; Album: Anamika; Artists: Asha Bhosle
[0059] Note that none of these snippets mention the Year of the song; it is possible to extract "Year: 1973" from some other relevant websites available on the Web. In addition, the Web can be used to identify other metadata attributes as well. Examples: "Cast" (actors), "Movie Director", etc. attributes that are useful for song discovery and for recommending purposes. [0060] FIG. 11 depicts a search portal offering results to a user, according to embodiments as disclosed herein. The figure depicts a user 1101, a search portal 1102, a fetch engine server 103, the WWW 104, a plurality of websites 105 and a database 106. The user 1 101 has access to the search portal using a suitable communication means. The communication means may be a computing device such as a computer, cellular phone, a Personal Digital Assistant (PDA) or any device capable of communicating to a portal. The device may communicate with the portal using any suitable communication means such as Internet Protocol (IP) based communications. The device may also communicate with the portal using a mobile data communication means such as GPRS (General Packet Radio Service), EDGE (Enhanced Data for Global Evolution), Short Messaging Service (SMS). The device may also communicate with the portal using a short field communication means such as Bluetooth, Infrared, Zigbee and so on. The user 1 101 using a suitable communication means sends the query to the search portal 1102.
[0061] The search portal 1102 offers the user 1 101 a means to enter the search query. The search portal 1 102 offers the user 1 101 an interface to access the results of the search query. The results presented by the search portal 1102 may comprise of media and text, media only or text only. The results presented by the search portal 1102 may comprise of clickable links, which provide the user 1101 with more information on being clicked. The results presented by the search portal 1102 may also comprise of a compacted view of the results, where the user 1101 will be able to view more information by hovering over the result.
[0062] The search portal 1 102 sends the query to the fetch engine server 103. The fetch engine server 103 crawls the WWW 104 for relevant websites 105, extracts factual information from web pages present in the websites 105, composes answers for the search query from the user 1 101, ranks the results according to relevance and other metrics and sends the results to the search portal 1102. The fetch engine server 103 functions independently of the structure of the websites 105 and does not have to change with changes in the layout of the websites 105. The fetch engine server 103 may also act on instructions received from an operator to extract the results from the websites 105. The fetch engine server 103 also stores the results in the database 106 for future reference. If a search query is repeating, the fetch engine server 103 may fetch the result from the database 106. If the fetch engine server 103 does not find any results to the search results or does not find satisfactory results, the search query may be posted to any suitable public access forum, where any person may answer the query. The fetch engine server 103 may also obtain feedback from the public users on the accuracy of the results.
[0063] The database 106 can be used to store information pertaining to searches performed. The information stored may be the search results. The information stored may also be the locations where the information was obtained. The database 106 may also be accessible to public users using a public forum or any other suitable access means.
[0064] On obtaining the results, the fetch engine server 103 sends the results to the user 1 101 using suitable communication means. In an embodiment herein, the fetch engine server 103 may send the results to the user 1 101 using the same communication means in which the user 1 101 send the search query to the search portal 1102. In another embodiment herein, the fetch engine server 103 may send the results to the user 1 101 using a different communication means from the communication means which the user 1101 used to send the result to the search portal 1102. The user 1101 may also indicate the communication means which has to be used to send the results. The results may be presented to the user using a graphical interface, in order of ranking. The user 1 101 may also sort the results according to his preference. The results may also be presented to the user 1101 in text only interface, a combination of text and graphics or any suitable format of presenting the results.
[0065] FIGs. 12a and 12b depict flowcharts illustrating the actions taken on receiving a search query, according to embodiments as disclosed herein. A user 1 101 sends (1201) a query to the search portal 1102, which is forwarded to the fetch engine server 103. On receiving the search query, the fetch engine server 103 checks (1202) the database 106 if the result is present within the database. If the result is present within the database 106, the fetch engine server 103 fetches (1203) the result from the database 106 and presents (1204) the result to the user 1 101 through the search portal 1 102. If the result is not present in the database 106, the fetch engine server 103 searches (1205) for relevant sources of information. If the fetch engine server 103 finds relevant sources of information, the fetch engine server 103 reduces (1207) the sources of information, which may be in the form of web pages, to semantic abstractions. The fetch engine server 103 divides (1208) the source into logical segments. The fetch engine server 103 classifies (1209) the segmented sources, depending on the expected results from each of the segmented sources. The fetch engine server 103 collates (1213) the segmented sources for the requested information. The fetch engine server 103 may crawl through the segmented sources depending on the classification of the segmented sources. The fetch engine server 103 then checks (1214) if the result is available in one source. If the result is available in one source, the fetch engine server 103 extracts (1217) the result from the source. If the result is available in more than one source, the fetch engine server 103 extracts (1215) the relevant snippets from the sources and stitches (1216) the snippets to form the result. The fetch engine server 103 ranks (1218) the results and presents (1219) the results to the user. The results may be presented to the user 1101 in a ranked order. The user 1101 may also sort the results in the order that he is interested in. The fetch engine server 103 also stores (1220) the results in the database 106 for future reference. If the search for relevant sources does not return satisfactory results, then the fetch engine server 103 posts the query, as received from the user 1101 on an open forum. If no results are obtained on the open forum, the fetch engine server 103 informs (1212) the user 1 101 of the failure to obtain the results. If results are obtained on the open forum, the fetch engine server 103 continues analyzing the results from step (1214). The various actions in method 1200 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 12 may be omitted.
[0066] FIG. 13 depicts a flow chart illustrating a method of searching and storing a result, according to embodiments as disclosed herein. The fetch engine server 103 crawls (1301) the web looking for sources containing information. The source of information may be a web page. On coming across a new source, the fetch engine server 103 checks (1302) if the source has been analyzed. If the source has already been analyzed, the fetch engine server 103 checks (1303) if the source has changed since the last time this particular source was analyzed. The fetch engine server 103 may check this by comparing the present source against a sample of the page, which was taken when the page was last analyzed. If the source has been analyzed and there are no changes in the source since the last analysis, the fetch engine server 103 continues to crawl for new sources. If the source has not been analyzed or there have been changes in the source since the last analysis, then the fetch engine server 103 crawls (1304) the source and detects (1305) the structure of the source. The structure of the source may be detected on the basis of the information contained in the source by analyzing the fields present in the source, information present in the fields and so on. The structure of the source may include of the source structure, layout elements, visual blocks, styling elements and so on. The fetch engine server 103 then extracts (1306) data from the source and stores (1307) the data in the database 106. The fetch engine server 103 may also stitch extracts of data from multiple sources, before storing the data in the database 106. The various actions in method 1300 may be performed in the order presented, in a different order or simultaneously. Further, in some embodiments, some actions listed in FIG. 13 may be omitted.
[0067] The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the network elements. The network elements shown in Figs. 1-6 and 11 include blocks which can be at least one of a hardware device, or a combination of hardware device and software module.
[0068] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the spirit and scope of the embodiments as described herein.

Claims

WE CLAIM :-
1. A method for collating information from a plurality of websites on the world wide web comprising of
A fetch engine server extracting a plurality of snippets of information from said plurality of websites;
Said fetch engine server cleansing and transforming said plurality of snippets; and Said fetch engine server merging said transformed plurality of snippets into an entity.
2. The method, as claimed in claim 1, wherein said method further comprises of said fetch engine server identifying a plurality of websites where said plurality of websites may contain required entities and snippets of information; and
said fetch engine server crawling said identified plurality of websites to check if said identified plurality of websites contain said required snippets of information corresponding to the required entities and snippets of information.
3. The method, as claimed in claim 1, wherein fetch engine server extracting a plurality of snippets of information from said plurality of websites further comprises of
Said fetch engine server reducing page layouts in said plurality of websites to semantic abstractions; and
Said fetch engine server dividing said plurality of websites into a plurality of logical segments.
4. The method, as claimed in claim 1, wherein said fetch engine server transforming said plurality of snippets comprises of Said fetch engine server filtering out snippets of information which do not meet pre-specified criteria; and
Said fetch engine server standardizes values present in attributes of said snippets.
5. The method, as claimed in claim 1 , wherein said method further comprises of Assigning at least one finger print to each snippet;
Marking entities as potentially-similar if at least one fingerprint is common between said plurality of snippets.
6. The method, as claimed in claim 1, wherein said fetch engine server stores said entity into a database.
7. The method, as claimed in claim 1, wherein said fetch engine server records metadata values associated with each of said plurality of entities.
8. The method, as claimed in claim 1 , wherein said fetch engine server creates links between a plurality of said entities.
9. A server for collating information from a plurality of websites on the world wide web, said server comprising of
A crawling and extraction engine for extracting a plurality of snippets of information from said plurality of websites;
A transformation engine for cleansing and transforming said plurality of snippets; and
A data integration engine for merging said transformed plurality of snippets into an entity.
10. The server, as claimed in claim 9, wherein said crawling and extraction engine identifies a plurality of websites where said plurality of websites may contain required entities and snippets of information; and crawls said identified plurality of websites to check if said identified plurality of websites contain said required snippets of information corresponding to the required entities and snippets of information.
11. The server, as claimed in claim 9, wherein crawling and extraction engine
reduces page layouts in said plurality of websites to semantic abstractions; and divides said plurality of websites into a plurality of logical segments.
12. The server, as claimed in claim 9, wherein said transformation engine
filters out snippets of information which do not meet pre-specified criteria; and standardizes values present in attributes of said snippets.
13. The server, as claimed in claim 9, wherein said server further comprising a de- duplication engine for
Assigning at least one finger print to each snippet;
Marking entities as potentially-similar if at least one fingerprint is common between said plurality of snippets.
14. The server, as claimed in claim 9, wherein said server stores said merged snippet into a database.
15. The server, as claimed in claim 9, wherein said server records metadata values associated with each of said plurality of entities.
16. The server, as claimed in claim 9, wherein said server creates links between a plurality of said entities.
PCT/IN2011/000263 2010-04-20 2011-04-20 Fetch engine WO2011132204A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN1103/CHE/2010 2010-04-20
IN1103CH2010 2010-04-20

Publications (2)

Publication Number Publication Date
WO2011132204A2 true WO2011132204A2 (en) 2011-10-27
WO2011132204A3 WO2011132204A3 (en) 2011-12-22

Family

ID=44834574

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2011/000263 WO2011132204A2 (en) 2010-04-20 2011-04-20 Fetch engine

Country Status (1)

Country Link
WO (1) WO2011132204A2 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026152A1 (en) * 2004-07-13 2006-02-02 Microsoft Corporation Query-based snippet clustering for search result grouping
US20060161542A1 (en) * 2005-01-18 2006-07-20 Microsoft Corporation Systems and methods that enable search engines to present relevant snippets
US20070208704A1 (en) * 2006-03-06 2007-09-06 Stephen Ives Packaged mobile search results
US20080114800A1 (en) * 2005-07-15 2008-05-15 Fetch Technologies, Inc. Method and system for automatically extracting data from web sites
US20080288675A1 (en) * 2007-05-18 2008-11-20 Seiko Epson Corporation Host device, information processor, electronic apparatus, program, and method for controlling reading

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060026152A1 (en) * 2004-07-13 2006-02-02 Microsoft Corporation Query-based snippet clustering for search result grouping
US20060161542A1 (en) * 2005-01-18 2006-07-20 Microsoft Corporation Systems and methods that enable search engines to present relevant snippets
US20080114800A1 (en) * 2005-07-15 2008-05-15 Fetch Technologies, Inc. Method and system for automatically extracting data from web sites
US20070208704A1 (en) * 2006-03-06 2007-09-06 Stephen Ives Packaged mobile search results
US20080288675A1 (en) * 2007-05-18 2008-11-20 Seiko Epson Corporation Host device, information processor, electronic apparatus, program, and method for controlling reading

Also Published As

Publication number Publication date
WO2011132204A3 (en) 2011-12-22

Similar Documents

Publication Publication Date Title
US7783668B2 (en) Search system and method
US8135669B2 (en) Information access with usage-driven metadata feedback
US8166013B2 (en) Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
US9846744B2 (en) Media discovery and playlist generation
Pu et al. Subject categorization of query terms for exploring Web users' search interests
US8756245B2 (en) Systems and methods for answering user questions
US8037051B2 (en) Matching and recommending relevant videos and media to individual search engine results
US20020103809A1 (en) Combinatorial query generating system and method
US20090254540A1 (en) Method and apparatus for automated tag generation for digital content
US20030135430A1 (en) Method and apparatus for classification
CN105045852A (en) Full-text search engine system for teaching resources
KR20070007031A (en) Systems and methods for search query processing using trend analysis
KR20070092718A (en) Search processing with automatic categorization of queries
US20200175081A1 (en) Server, method and system for providing information search service by using sheaf of pages
Li et al. Getting work done on the web: supporting transactional queries
JP2008537809A (en) Information search service providing server, method and system using page group
KR102256007B1 (en) System and method for searching documents and providing an answer to a natural language question
Li et al. Incorporating document keyphrases in search results
Gruhl et al. The web beyond popularity: a really simple system for web scale rss
KR100659370B1 (en) Method for constructing a document database and method for searching information by matching thesaurus
Sabou et al. Towards improving web service repositories through semantic web techniques
WO2011132204A2 (en) Fetch engine
KR102434880B1 (en) System for providing knowledge sharing service based on multimedia platform
KR20240015280A (en) Systems and methods for search query processing using trend analysis
Qazimi News information retrieval over Albanian language documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11771689

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11771689

Country of ref document: EP

Kind code of ref document: A2