US20160335286A1

US20160335286A1 - State Deduplication for Automated and Semi-Automated Crawling Architecture

Info

Publication number: US20160335286A1
Application number: US14/869,810
Authority: US
Inventors: Kalyan Desineni; Manikandan Sankaranarasimhan
Original assignee: Quixey Inc
Current assignee: Samsung Electronics Co Ltd
Priority date: 2015-05-13
Filing date: 2015-09-29
Publication date: 2016-11-17
Also published as: US10387379B2; US10152488B2; US10120876B2; US20160335348A1; US10146785B2; US20160335356A1; US20160335333A1; US20160335349A1

Abstract

A system for automated acquisition of content from an application includes state storage for storing state records. Each record includes a representation of content of a corresponding state of the application. A link tracking module controls the application to navigate to a first state of the application. A duplicate content detector calculates a representation of content of the first state and generates a comparison signal indicating whether the calculated representation matches any of the stored representations of content in the state records. The link tracking module creates a new state record in the state storage only in response to the comparison signal indicating that no match is found. The calculated representation is stored in the new state record. A scraper module, for each of the state records in the state storage, extracts text and metadata. Information based on the extracted text and metadata is stored in a data store.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 62/161,253, filed on May 13, 2015, and U.S. Provisional Application No. 62/193,051, filed on Jul. 15, 2015. The entire disclosures of the applications referenced above are incorporated herein by reference.

FIELD

The present disclosure relates to crawling applications for content, and more particularly to crawling mobile applications for content.

BACKGROUND

Search engines are an integral part of today's world. A key component of a search engine is the collection of search indices that power the search. In the context of a search engine, a search index can be an inverted index that associates keywords or combinations of keywords to documents (e.g., web pages) that contain the keyword or combination of keywords. In order to generate and maintain these search indexes, most search engines use crawlers to identify documents and information within the documents. A traditional crawler requests a document from a content provider and the content provider provides the requested document to the crawler. The crawler then identifies and indexes the keywords and combinations of keywords in the document.
As the world transitions to a mobile-based architecture, the way content providers provide access to their content is changing. User devices can access content using a variety of different mechanisms. For example, user devices can obtain content from a content provider using a native application dedicated to accessing a software application of the content provider or a web browser that accesses the software application using a web browser. Furthermore, content providers may allow access to different content depending on the geographic region of a user device, the type of user device, the time of day, and/or the operating system of the user device. For these and other reasons, crawling has become an increasingly difficult task.

SUMMARY

A system for automated acquisition of content from an application includes a state storage module configured to store application state records. Each record of the application state records includes a representation of content of a corresponding application state of the application. The system includes a link tracking module configured to control an executing instance of the application to navigate to a first application state of the application. The system includes a duplicate content detector configured to (i) calculate a representation of content of the first application state, (ii) compare the calculated representation to the stored representations of content in the application state records, and (iii) generate a comparison signal based on the comparison. The comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records. The link tracking module is configured to create a new application state record in the state storage module only in response to the comparison signal indicating that no match is found. The calculated representation is stored in the new application state record. The system includes a scraper module configured to, for each of the application state records in the state storage module, extract text and metadata from the corresponding application state. Information based on the extracted text and metadata is stored in a data store.
In other features, each record of the application state records includes a unique identifier that uniquely identifies the corresponding application state among the application state records of the state storage module. The duplicate content detector is configured to calculate the representation of content of the first application state only in response to a unique identifier of the first application state not matching any of the stored unique identifiers in the application state records of the state storage module.
In other features, the link tracking module is configured to, in response to the comparison signal indicating that the match is found between the calculated representation and a stored representation of a second record of the application state records, add the unique identifier of the first application state to the second record. In response to the comparison signal indicating that no match is found, store the unique identifier of the first application state in the new application state record.
In other features, for each record of the application state records, the unique identifier identifies a path followed within the executing application instance from a default state of the application to the corresponding application state. The unique identifier of the first application state indicates a path followed within the executing application instance from the default state of the application to the first application state. In other features, for each record of the application state records, the path includes user interface interactions performed within the executing application instance in order to navigate from the default state of the application to the corresponding application state.
In other features, each record of the application state records includes a user interface (UI) fingerprint of the corresponding application state. The duplicate content detector is configured to, in response to the calculated representation matching the stored representation of content of a second application state record, determine a UI fingerprint of the first application state, compare the UI fingerprint of the first application state to the stored UI fingerprint of the second application state record, and generate the comparison signal indicating that no match is found in response to the UI fingerprint of the first application state not matching the UI fingerprint of the second application state record.
In other features, for each record of the application state records, the UI fingerprint includes at least one of (i) a width of a UI widget tree extracted from the corresponding application state and (ii) a depth of the UI widget tree extracted from the corresponding application state. In other features, for each record of the application state records, the UI fingerprint includes a list of counts for predetermined UI widget types, each of the counts representing a number of occurrences of the predetermined UI widget type in the corresponding application state.
In other features, the duplicate content detector is configured to generate the comparison signal indicating that no match is found in response to a percentage difference between the UI fingerprint of the first application state and the UI fingerprint of the second application state record being greater than a predetermined threshold. In other features, the duplicate content detector is configured to calculate the representation of the content of the first application state by calculating a hash value of the content of the first application state.
In other features, the content of the first application state includes text displayed within the first application state. In other features, the link tracking module is configured to execute the instance of the application within an emulator. In other features, the link tracking module is configured to identify a set of application states reachable from the first application state. Each of the set of application states is reachable via a respective user interface interaction with the first application state. The link tracking module is further configured to control the executing instance of the application to navigate to each of the set of application states in turn.
In other features, the link tracking module is configured to identify a request made by the application to a remote server in order for the application to render the first application state. The duplicate content detector is configured to generate the comparison signal indicating a match between the first application state and a second application state record in response to parameters of the identified request matching stored request parameters in the second application state record.
A search system includes the above system and the data store. The search system also includes a set generation module configured to, in response to a query from a user device, select records from the data store to form a consideration set of records. The search system also includes a set processing module configured to assign a score to each record of the consideration set of records. The search system also includes a results generation module configured to respond to the user device with a subset of the consideration set of records. The subset is selected based on the assigned scores. The subset identifies application states of applications that are relevant to the query.
A method for automated acquisition of content from an application includes storing application state records. Each record of the application state records includes a representation of content of a corresponding application state of the application. The method includes controlling an executing instance of the application to navigate to a first application state of the application. The method includes calculating a representation of content of the first application state. The method includes comparing the calculated representation to the stored representations of content in the application state records. The method includes generating a comparison signal based on the comparison. The comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records. The method includes, only in response to the comparison signal indicating that no match is found, creating and storing a new application state record for the first application state. The calculated representation is stored in the new application state record. The method includes, for each of the application state records, extracting text and metadata from the corresponding application state. Information based on the extracted text and metadata is stored in a data store.
In other features, each record of the application state records includes a unique identifier that uniquely identifies the corresponding application state among the application state records. The calculating the representation of content of the first application state is performed only in response to a unique identifier of the first application state not matching any of the stored unique identifiers in the application state records. In other features, the method includes, in response to the comparison signal indicating that the match is found between the calculated representation and a stored representation of a second record of the application state records, adding the unique identifier of the first application state to the second record. The method includes, in response to the comparison signal indicating that no match is found, storing the unique identifier of the first application state in the new application state record.
In other features, for each record of the application state records, the unique identifier identifies a path followed within the executing application instance from a default state of the application to the corresponding application state. In addition, the unique identifier of the first application state indicates a path followed within the executing application instance from the default state of the application to the first application state. In other features, for each record of the application state records, the path includes user interface interactions performed within the executing application instance in order to navigate from the default state of the application to the corresponding application state.
In other features, each record of the application state records includes a user interface (UI) fingerprint of the corresponding application state. The method includes, in response to the calculated representation matching the stored representation of content of a second application state record: determining a UI fingerprint of the first application state; comparing the UI fingerprint of the first application state to the stored UI fingerprint of the second application state record; and generating the comparison signal indicating that no match is found in response to the UI fingerprint of the first application state not matching the UI fingerprint of the second application state record.
In other features, for each record of the application state records, the UI fingerprint includes at least one of (i) a width of a UI widget tree extracted from the corresponding application state and (ii) a depth of the UI widget tree extracted from the corresponding application state. In other features, for each record of the application state records, the UI fingerprint includes a list of counts for predetermined UI widget types, each of the counts representing a number of occurrences of the predetermined UI widget type in the corresponding application state.
In other features, the method includes generating the comparison signal indicating that no match is found in response to a percentage difference between the UI fingerprint of the first application state and the UI fingerprint of the second application state record being greater than a predetermined threshold. In other features, the method includes calculating the representation of the content of the first application state by calculating a hash value of the content of the first application state. In other features, the content of the first application state includes text displayed within the first application state.
In other features, the method includes identifying a set of application states reachable from the first application state. Each of the set of application states is reachable via a respective user interface interaction with the first application state. The method includes controlling the executing instance of the application to navigate to each of the set of application states in turn. In other features, the method includes identifying a request made by the application to a remote server in order for the application to render the first application state. The method includes generating the comparison signal indicating a match between the first application state and a second application state record in response to parameters of the identified request matching stored request parameters in the second application state record.
In other features, a non-transitory computer-readable medium stores processor-executable instructions configured to perform any of the above methods.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure will become more fully understood from the detailed description and the accompanying drawings.

FIG. 1 is a combined functional block diagram and graphical user interface example according to the principles of the present disclosure.

FIG. 2 is a functional block diagram of an example implementation of the search system of FIG. 1.

FIG. 3A is a graphical representation of an example application state record format.

FIG. 3B is a graphical representation of an example application state record according to the format of FIG. 3A.

FIG. 4 is a functional block diagram of an unguided app crawling system.

FIG. 5 is an illustration of an example state table generated by unguided app crawling according to the principles of the present disclosure.

FIG. 6 is a flowchart of example operation of unguided app crawling.

FIG. 7 is a graphical representation of an example widget tree of a fictitious application state.

In the drawings, reference numbers may be reused to identify similar and/or identical elements.

DETAILED DESCRIPTION

Example User Interface

In FIG. 1, an unsophisticated Search App 100 is shown running on a user device, such as smartphone 104. A first state of the Search App 100 is shown at 100-1, and the corresponding reference numeral for the smartphone 104 is 104-1. In the simple interface of the Search App 100-1, a search bar 112 allows a user to perform a search—in this case, for reviews of a (thankfully) fictional movie “The Hobbit XIII.”
When a user of the Search App 100 selects (such as by tapping their finger on) a button 120 (having a magnifying glass icon), a query wrapper 124 is sent to a search system 132. Example contents of the query wrapper 124 may include a text query, such as “The Hobbit XIII Movie Reviews.” Note that the text in the query wrapper 124 includes not just the function (movie reviews) but also an indication of an entity (“The Hobbit XIII”) that is the target of the function. This indicates the user's intent that not only should a movie review app be shown, but preferably a state of the movie review app that directly provides reviews of The Hobbit XIII.
The search system 132, as described in more detail below, identifies relevant apps and app states based on the query wrapper 124. For example, relevant app states will generally include apps that have movie review functionality and that include a state with movie reviews of the identified entity (“The Hobbit XIII”). The search system 132 returns app state results 140 to the smartphone 104, and example contents of the app state results 140 are described in more detail below.
A graphical presentation of the app state results 140 is displayed in a state 100-2 of the Search App 100, and the corresponding reference numeral for the smartphone 104 showing the state 100-2 is 104-2. The graphical results may be displayed in a portion of the Search App 100. In various implementations, the results may slide in from one side or from the top or bottom, suggesting to the user that the results can be dismissed by swiping in an opposite direction. The search string that yielded the results may be presented to the user, such as in a textbox 144. The textbox 144 may allow the user to revise the search string in order to perform additional searches.
Each graphical result of the app state results 140 may include a header (or, title), such as the header “Fandango Movies” at 148. The header may be the same as a title of an app, and may also indicate whether the app is installed. As shown in FIG. 1 with an “installed” parenthetical, “Fandango Movies” is already installed on the smartphone 104. Other text, such as “Open,” may similarly indicate that the app is already installed. Apps not yet installed may be indicated with “Download” or “Install” labels. Equivalently, icons or other visual cues may distinguish between apps that can simply be opened (including native apps and, as described in more detail below, web apps) or first need to be installed.
Two specific states are displayed with respect to the “Fandango Movies” app: “The Hobbit XIII” at 152 and “The Hobbit XIII (Extended)” at 156. This text may be the title text of the corresponding state in the “Fandango Movies” app. Additional data associated with each of these states may be shown. For example, the search system 132 may indicate that “The Hobbit XIII” state of the “Fandango Movies” app includes a 3-star rating. This 3-star rating may be shown at 160. Other data may include snippets of text (such as the first few words of a review), an image (such as a screenshot of the state), a reliability metric (such as number of user reviews), a freshness metric (such as most recent observed update to the state), etc.
These specific states may include user-selectable links directly to the corresponding entries in the “Fandango Movies” app. In other words, in response to user selection (such as by tapping the area of the screen associated with “The Hobbit XIII” 152), the Search App 100 will open the “Fandango Movies” app to the state where movie reviews are shown for “The Hobbit XIII.” As described in more detail below, this direct action may be accomplished by passing an identifier of the “The Hobbit XIII” state as a parameter to the “Fandango Movies” app or by executing a script that navigates to the state for the “The Hobbit XIII” from another state of the “Fandango Movies” app.
If the user selects an area of the graphical results in the Search App 100 that is associated with the “Fandango Movies” app, but not with one of the specific states 152 or 156, the Search App 100 may open the “Fandango Movies” app to a default state. In other implementations, selecting an area not associated with one of the specific states 152 or 156 will result in no action.
A deep view card for an application or a state of an application shows additional information, not just the identification of the application or application state. For example, the information may include a title of the application state or a description of the application state, which may be a snippet of text from the application state. Other metadata may be provided from the application state, including images, location, number of reviews, average review, and status indicators. For example, a status indicator of “open now” or “closed” may be applied to a business depending on whether the current time is within the operating hours of the business.
Some deep view cards may emphasize information that led to the deep view card being selected as a search result. For example, text within the deep view card that matches a user's query may be shown in bold or italics. The deep view card may also incorporate elements that allow direct actions, such as the ability to immediately call an establishment or to transition directly to a mapping application to get navigation directions to the establishment. Other interactions with the deep view card (such as tapping or clicking any other area of the deep view card) may take the user to the indicated state or application. As described in more detail below, this may be accomplished by opening the relevant app or, if the app is not installed, opening a website related to the desired application state. In other implementations, an app that is not installed may be downloaded, installed, and then executed in order to reach the desired application state.
In other words, a deep view card includes an indication of the application or state as well as additional content from the application or state itself. The additional content allows the user to make a more informed choice about which result to choose, and may even allow the user to directly perform an action without having to navigate to the application state. If the action the user wants to take is to obtain information, in some circumstances the deep view card itself may provide the necessary information.
A deep view is presented for “IMDb Movies & TV” at 164. A user-selectable link 168 is shown for a state of the “IMDb Movies & TV” app titled “The Hobbit XIII: Smaug Enters REM Sleep.” The “IMDb Movies & TV” app is not shown with an “installed” parenthetical, indicating that download and installation must first be performed.
Selecting the user-selectable link 168 may therefore trigger the opening of a digital distribution platform in either a web browser or a dedicated app, such as the app for the GOOGLE PLAY STORE digital distribution platform. The identity of the app to be downloaded (in this case, the IMDb app) is provided to the digital distribution platform so that the user is immediately presented with the ability to download the desired app. In some implementations, the download may begin immediately, and the user may be given the choice of approving installation. Upon completion of installation, control may automatically navigate to the desired state of the “IMDb Movies & TV” app—that is, the state for “The Hobbit XIII: Smaug Enters REM Sleep”.
A “Movies by Flixster” app title is shown at 176, and is associated with a user-selectable link 180 for a state titled “The Hobbit XIII” and a user-selectable link 182 for a state titled “The Hobbit XII.” The user-selectable link 180 includes additional data associated with the state for “The Hobbit XIII.” Specifically, graphical and numerical representations of critics' reviews of the movies “The Hobbit XIII” and “The Hobbit XII” are depicted at 184.

Search Module

In FIG. 2, an example implementation of the search system 132 includes a search module 200. The search module 200 includes a query analysis module 204 that receives a query wrapper, such as the query wrapper 124 of FIG. 1. The query analysis module 204 analyzes the text query from the query wrapper. For example, the query analysis module 204 may tokenize the query text, filter the query text, and perform word stemming, synonymization, and stop word removal. The query analysis module 204 may also analyze additional data stored within the query wrapper. The query analysis module 204 provides the tokenized query to a set generation module 208.
The set generation module 208 identifies a consideration set of application state records from a search data store 210 based on the query tokens. Application (equivalently, app) state records are described in more detail in FIG. 3A and FIG. 3B. In various implementations, the search data store 210 may also include app records. In various implementations, an app record may be stored as an app state record that simply has a predetermined value, such as null, for the specific state of the app.
App state records in the search data store 210 may be generated by crawling and scraping apps according to the principles of the present disclosure. Some or all of the contents of the records of the search data store 210 may be indexed in inverted indices. In some implementations, the set generation module 208 uses the APACHE LUCENE software library by the Apache Software Foundation to identify records from the inverted indices. The set generation module 208 may search the inverted indices to identify records containing one or more query tokens. As the set generation module 208 identifies matching records, the set generation module 208 can include the unique ID of each identified record in the consideration set. For example, the set generation module 208 may compare query terms to an app state name and app attributes (such as a text description and user reviews) of an app state record.
Further, in some implementations, the set generation module 208 may determine an initial score of the record with respect to the search query. The initial score may indicate how well the contents of the record matched the query. For example, the initial score may be a function of term frequency-inverse document frequency (TF-IDF) values of the respective query terms.
A set processing module 212 receives unique IDs of app state records identified by the set generation module 208 and determines a result score for some or all of the IDs. A result score indicates the relevance of an app state with respect to the tokenized query and context parameters. In various implementations, a higher score indicates a greater perceived relevance.
For example, other items in the query wrapper may act as context parameters. Geolocation data may limit the score of (or simply remove altogether) apps that are not pertinent to the location of the user device. A blacklist in the query wrapper may cause the set processing module 212 to remove app records and/or app state records from the consideration set that match the criteria in the blacklist, or to set their score to a null value, such as zero.
The set processing module 212 may generate a result score based on one or more scoring features, such as record scoring features, query scoring features, and record-query scoring features. Example record scoring features may be based on measurements associated with the record, such as how often the record is retrieved during searches and how often links generated based on the record are selected by a user. Query scoring features may include, but are not limited to, the number of words in the search query, the popularity of the search query, and the expected frequency of the words in the search query. Record-query scoring features may include parameters that indicate how well the terms of the search query match the terms of the record indicated by the corresponding ID.
The set processing module 212 may include one or more machine-learned models (such as a supervised learning model) configured to receive one or more scoring features. The one or more machine-learned models may generate result scores based on at least one of the record scoring features, the query scoring features, and the record-query scoring features.
For example, the set processing module 212 may pair the search query with each app state ID and calculate a vector of features for each {query, ID} pair. The vector of features may include one or more record scoring features, one or more query scoring features, and one or more record-query scoring features. In some implementations, the set processing module 212 normalizes the scoring features in the feature vector. The set processing module 212 can set non-pertinent features to a null value or zero.
The set processing module 212 may then input the feature vector for one of the app state IDs into a machine-learned regression model to calculate a result score for the ID. In some examples, the machine-learned regression model may include a set of decision trees (such as gradient-boosted decision trees). Additionally or alternatively, the machine-learned regression model may include a logistic probability formula. In some implementations, the machine-learned task can be framed as a semi-supervised learning task, where a minority of the training data is labeled with human-curated scores and the rest are used without human labels.
The machine-learned model outputs a result score of the ID. The set processing module 212 can calculate result scores for each of the IDs that the set processing module 212 receives. The set processing module 212 associates the result scores with the respective IDs and outputs the most relevant scored IDs.
A results generation module 224 may choose specific access mechanisms from the application records and app state records chosen by the set processing module 212. The results generation module 224 then prepares a results set to return to the user device. Although called “app state results” here, some of the access mechanisms may correspond to a default state (such as a home page) of an app—these may be a special case of an app state record or may be an app record.
The results generation module 224 may select an access mechanism for an app state record based on whether the app is installed on the device. If the app is installed, an access mechanism that opens the app directly to the specified state is selected. Meanwhile, if the app is not installed, a selected access mechanism first downloads and installs the app, such as via a script, before opening the app to the specified state. Opening the app to the specified state may include a single command or data structure (such as an intent in the ANDROID operating system) that directly actuates the specified state. For other apps, a script or other sequence may be used to open the app to a certain state (such as a home, or default, state) and then navigate to the specified state.
The results generation module 224 may generate or modify access mechanisms based on the operating system identity and version for the user device to which the results are being transmitted. For example, a script to download, install, open, and navigate to a designated state may be fully formed for a specific operating system by the results generation module 224.
If the results generation module 224 determines that none of the native access mechanisms are likely to be compatible with the user device, the search module 200 may send a web access mechanism to the user device. If no web access mechanism is available, or would be incompatible with the user device for some reason (for example, if the web access mechanism relies on the JAVA programming language, which is not installed on the user device), the results generation module 224 may omit the result.

App State Records

In FIG. 3A, an example format of an app state record 250 includes an app state identifier (ID) 250-1, app state information 250-2, an app identifier (ID) 250-3, and one or more access mechanisms 250-4. The app state ID 250-1 may be used to uniquely identify the app state record 250 in the search data store 210. The app state ID 250-1 may be a string of alphabetic, numeric, and/or special (e.g., punctuation marks) characters that uniquely identifies the associated app state record 250. In some examples, the app state ID 250-1 describes the application state in a human-readable form. For example, the app state ID 250-1 may include the name of the application referenced in the access mechanisms 250-4.
In a specific example, an app state ID 250-1 for an Internet music player application may include the name of the Internet music player application along with the song name that will be played when the Internet music player application is set into the specified state. In some examples, the app state ID 250-1 is a string formatted similarly to a uniform resource locator (URL), which may include an identifier for the application and an identifier of the state within the application. In other implementations, a URL used as the app state ID 250-1 may include an identifier for the application, an identifier of an action to be provided by the application, and an identifier of an entity that is the target of the action.
For example only, see FIG. 3B, which shows an example app state record 254 associated with the OPENTABLE application from OpenTable, Inc. The OPENTABLE application is a restaurant-reservation application that allows users to search for restaurants, read reviews, and make restaurant reservations. The example app state record 254 of FIG. 3B describes an application state of the OPENTABLE application in which the OPENTABLE application accesses information for THE FRENCH LAUNDRY restaurant, a Yountville, Calif. restaurant. An app state ID 254-1 for the example app state record 254 is shown as “OpenTable—The French Laundry.”
Another implementation of the displayed app state ID 254-1 is based on a triplet of information: {application, action, entity}. The triplet for the app state record 254 may be {“OpenTable”, “Show Reviews”, “The French Laundry”}. As mentioned above, this triplet may be formatted as a URL, such as the following: “func://www.OpenTable.com/Show_Reviews/The_French_Laundry”. Note that a different namespace is used (“func://”) to differentiate from the standard web namespace (“http://”), as the URL-formatted ID may not resolve to an actual web page. For example only, the OpenTable website may use a numeric identifier for each restaurant in their web URLs instead of the human-readable “The_French_Laundry.”
Continuing with FIG. 3A, the app state information 250-2 may include data that describes an app state into which an application is set according to the access mechanisms 250-4. The types of data included in the app state information 250-2 may depend on the type of information associated with the app state and the functionality specified by the access mechanisms 250-4. The app state information 250-2 may include a variety of different types of data, such as structured, semi-structured, and/or unstructured data. The app state information 250-2 may be automatically and/or manually generated and updated based on documents retrieved from various data sources, which may include crawling of the apps themselves.
In some examples, the app state information 250-2 includes data presented to a user by an application when in the app state corresponding to the app state record 250. For example, if the app state record 250 is associated with a shopping application, the app state information 250-2 may include data that describes products (such as names and prices) that are shown in the app state corresponding to the app state record 250. As another example, if the app state record 250 is associated with a music player application, the app state information 250-2 may include data that describes a song (such as by track name and artist) that is played or displayed when the music player application is set to the specified app state.
When the app state record 250 corresponds to a default state of an application, the app state information 250-2 may include information generally relevant to the application and not to any particular app state. For example, the app state information 250-2 may include the name of the developer of the application, the publisher of the application, a category (e.g., genre) of the application, a text description of the application (which may be specified by the application's developer), and the price of the application. The app state information 250-2 may also include security or privacy data about the application, battery usage of the application, and bandwidth usage of the application. The app state information 250-2 may also include application statistics, such as number of downloads, download rate (for example, average downloads per month), download velocity (for example, number of downloads within the past month as a percentage of total downloads), number of ratings, and number of reviews.
In FIG. 3B, the example app state record 254 includes app state information 254-2, including a restaurant category field 254-2 a of THE FRENCH LAUNDRY restaurant, a name and text description field 254-2 b of THE FRENCH LAUNDRY restaurant, user reviews field 254-2 c of THE FRENCH LAUNDRY restaurant, and additional data fields 254-2 d.
The restaurant category field 254-2 a may include multiple categories under which the restaurant is categorized, such as the text labels “French cuisine” and “contemporary.” The name and description field 254-2 b may include the name of the restaurant (“The French Laundry”) and text that describes the restaurant. The user reviews field 254-2 c may include text of user reviews for the restaurant. The additional data fields 254-2 d may include additional data for the restaurant that does not specifically fit within the other defined fields, such as a menu, prices, and operating hours.
Continuing with FIG. 3A, the app ID 250-3 uniquely identifies an application associated with the app state record 250. For example, a value for application ID 254-3 in the app state record 254 uniquely identifies the OpenTable application. The application ID 254-3 may refer to a canonical OpenTable software product that encompasses all of the editions of the OpenTable application, including all the native versions of the OpenTable application across platforms (for example, IOS and ANDROID operating systems) and any web editions of the OpenTable application.
The access mechanisms 250-4 specify one or more ways that the state specified by the app state record 250 can be accessed. For any given user device, only some of the access mechanisms 250-4 may be relevant. For illustration, the example app state record 254 depicts three access mechanisms 254-4, including access mechanism “a” 254-4 a, access mechanism “b” 254-4 b, and access mechanism “c” 254-4 c.
For example, the access mechanism 250-4 a may include a reference to a native IOS operating system edition of the OPENTABLE application along with one or more operations to be performed by the user device. For example, the access mechanism 250-4 a may include an application resource identifier for the native iOS edition of the OPENTABLE application and one or more operations that navigate to the state in the OPENTABLE application for THE FRENCH LAUNDRY restaurant.
The access mechanism 250-4 b may include a reference to a native ANDROID operating system edition of the OPENTABLE application along with one or more operations to be performed by the user device to navigate to the state in the ANDROID OPENTABLE application for THE FRENCH LAUNDRY. The access mechanism 250-4 c may include a reference to a web edition of the OPENTABLE application, such as a URL that corresponds to a web page for THE FRENCH LAUNDRY restaurant on the OPENTABLE web site.

Unguided Crawling

In the web domain, crawlers generally operate by following links between pages. Because of the way pages link to each other, a crawler will sometimes see a page it has already crawled. The web crawler should detect that the same page is being crawled a second time and avoid crawling it. The web crawler may perform this duplicate detection by comparing the URL (uniform resource locator) of a target page to the URLs of already-crawled pages to verify that the target page has not yet been crawled. However, because app states do not have URLs, this problem is harder to solve when crawling apps.
One solution is to uniquely identify a state with a URI (uniform resource identifier), which is a more generalized form of a URL such as is used on the web. A URI may take the form of a functional URL (which specifies a particular starting state, and may correspond one-to-one with a predefined intent) combined with (optional) steps used to get to the state. For example, a particular state may be accessed by navigating to a first state F1 and then selecting button B1 from the user interface. A URI for that state can be labeled as “<F1,B1>”. The same state may also be accessed by navigating to state F2 and selecting button B2. Thus the same state may have multiple URIs: “<F1,B1>” and “<F2,B2>”.
When crawling, the crawler would ideally be able to detect whether a state is the same as a previous state already crawled. This is difficult because the crawler may have followed a different path to the state then it previously has, so that the URI for the state is different from the previously-crawled URI. To solve this problem, the crawler may use a deduplication module that detects whether the content in the current state is the same as the content in a previously-crawled state.
Content matching may use one or more different methods. Because the content of states may vary when viewed at different times (e.g., different ads may appear, or real-time content may change), a simple hash may not work by itself. Possible solutions include using Bloom filters and fuzzy content matching.
Since apps use UI (user interface) widgets to display content, UI layout may assist in content matching. Each widget may have a layout defining further UI elements inside it. The whole content can be captured in this hierarchical form along with widget info and content, and dumped as, for example, a JSON (JavaScript Object Notation) file. Dumping content may be performed by the app scraper or by a module that uses code forked from the app scraper. The deduplication module will compare this JSON output to previously-obtained JSON output to determine whether there is a match. Naively, two states are similar if they have identical hierarchy and content. However, since even the same state may differ (for example, based on ads), a threshold of matched contents, matched hierarchies, or a combination may be defined.
This threshold may be expressed as a percentage and may be adjusted dynamically, for example having different values between different apps, or between different app state templates. When comparing both UI hierarchy and content, a weighted average may be used to combine a percentage similarity from both into a single overall percentage similarity. The single overall percentage similarity is then compared with a threshold to determine whether the two states should be considered to be the same.
To enable deduplication, the content of each state may be stored and associated with the URIs for that state. As described in more detail below, in some cases the system may observe that an intent fires when a state is loaded. As part of dynamic analysis, APIs (application programming interfaces) used in intent firing may be monitored. As part of this monitoring, which UI action triggered an intent can be detected and therefore associated with the state.
In such a case, the state may be accessed directly using the intent without taking any additional steps. For example, the crawler accesses a state using a URI <F1,B1> and then observes that intent F2 fires. Both <F1,B1> and <F2> may be added to the URI list for that state.
In FIG. 4, an unguided crawler module 300 includes an emulator 304, a link tracking module 308, a duplicate content detector 312, and content storage 316. The content storage 316 may include a relational database or an array of data structures, each data structure corresponding to one state and including a representation of content and one or more identifiers.
The unguided crawler module 300 begins with no insight into the application of interest. As explained in more detail in FIG. 6, the unguided crawler module 300 opens up the app (which may be provided as an APK file in some environments) to a home state in the emulator 304. From the home state, the unguided crawler module 300 follows every possible path.
Specifically, the link tracking module 308 determines all of the possible states reachable from the home state via possible user interface interactions and adds those states to a list of states to crawl. For ease of illustration, a single emulator 304 is shown in FIG. 4. However, the link tracking module 308 may operate multiple emulators in parallel, and may assign different states for each respective emulator to crawl.
The emulator 304 emulates a host operating system, such as the ANDROID operating system, in which the application of interest is executed (at 320). In various implementations, the emulator 304 may be instantiated at a cloud hosting operator that may provide compute facilities within which to execute emulator code or that may directly provide emulator instances for one or more mobile device operating systems.
In other implementations, a physical device running the operating system may be used. For example, some operating systems may not have suitable emulators. The physical device may be connected to the link tracking module 308 using a wireless or wired interface, such as USB (universal serial bus). As an example only, a physical smartphone may be connected via USB to an interface card that is controlled by the link tracking module 308.
In order to reach a state within the executing application 320, the link tracking module 308 sends an access path (sometimes referred to as a breadcrumb trail, and described in more detail below) to a link extractor actor 322. When the link extractor actor 322 identifies that an intent is available to reach an activity, the link tracking module 308 passes this information back to the link tracking module 308 as another path to the state.
In various implementations, the link extractor actor 322 executes within the emulator 304 and communicates with the executing application 320 using accessibility hooks or events. The link extractor actor 322 may identify user interface actions from the current state that will result in another state being reached using a scraping module 324. These new states are provided to the link tracking module 308.
In other words, the link extractor actor 322 can provide simulated user input to the executing application 320 and the scraping module 324 extracts content from each displayed state of the executing application 320. The link tracking module 308 can then identify further states to visit from the scraped information and instruct the link extractor actor 322 to follow a user interface path corresponding to the next state of interest.
In implementations where the emulator 304 is instead a physical device, the link extractor actor 322 and the scraping module 324 may be installed as root-level applications on the physical device. Installing a root-level application may include designating the application as a launcher replacement and/or bypassing security limitations of the firmware or operating system regarding privileges of installed apps.
The crawl of an app may optionally begin with a user login. An operator of the unguided crawler module 300 may record steps (user interface events) to perform the login. Alternatively, stored private settings (or authentication tokens) may be present in some ANDROID apps, and may be able to be copied from a device the operator had previously been using. Then the crawl can occur without operator intervention.
Each state that is reached is labeled with a uniform resource identifier (URI) according to the series of user interface interactions that were performed to reach the state. States can be compared by URI: if the URI of a state matches a previous URI, the present state is assumed to be the same as the previous state. A state may be reached through a variety of paths and may therefore be labeled with multiple URIs.
From a given state, the unguided crawler module 300 may generate URIs for new states to crawl by taking the URI of the current state and appending an action corresponding to each UI element on the page. For example, a button B2 available on a state with URI <F1,B1> may be used to generate a URI <F1,B1,B2>. When accessing a state using a URI (during a crawl or for any other reason), the most efficient way of accessing the state should generally be used. This may be a direct URL (access URL) or intent if available; otherwise it will be the URI that has the least number of additional user actions required to access the state.

Deduplication

To identify whether the present state is the same as a prior state with a different URI, the content of each state is stored in the content storage 316. The duplicate content detector 312 compares the content of the present state to the content of prior states. When the content matches, the states are assumed to be the same, and that state is then listed with both URIs. In various implementations, the content storage 316 may store a signature or fingerprint of content within each state. For example, a hash or other compression function may be calculated based on the content in the state.
As described in more detail below, the user interface layout of the state may also be fingerprinted—that is, described in a reduced representation. For example only, the content storage 316 may store a signature of the content of a state as well as a signature of the user interface layout of the state.
Especially because crawling apps for content is generally slower than comparable web crawling, identifying duplicates ahead of time reduces the burden for both crawling and scraping. Techniques to identify duplicates in app content include content metadata tagging, UI (user interface) pattern matching, and API (application programming interface) comparison.
The API calls made by a state may be a unique identifier of the state. For example, a state that sends the same data in an API call to a server may be considered to be the same as another state that sends that same data, even if the response from the server is dynamic and changes over time. For example, in a restaurant info state, the featured reviews may change over time, causing the text content of the state to change over time. In such circumstances, the API request can be used to identify that a selected state is actually the same as a previously-stored state.
Using content metadata tagging, a set of key fields is identified to create metadata for the content. The attributes are identified in a way that they can uniquely identify the content in the activity. The metadata is hashed and stored in the content storage 316. The duplicate content detector 312 checks the hash of any new state against existing hashes.
The hash algorithm may be chosen based on the desire to avoid collisions, but with the flexibility of not needing to be cryptographically secure. In other words, in most implementations, there is no requirement that the hash function be one-way (that is, impossible with reasonable computing resources to determine the input data that resulted in the output).
The length of the hash function may be based on how many unique states are expected in an application, with more states leading to a longer length to reduce the probability of collision to an acceptable level. The hash may be calculated using, as examples only, a cyclic redundancy check, a checksum, Rabin fingerprinting, the Fowler-Noll-Vo algorithm, Pearson hashing, a Jenkins hash function, or a hashCode( ) method of the JAVA programming language.
A URI-based state list (such as that shown in FIG. 5) generated by the unguided crawler module 300 is consumed by the optimal path selection module 326. The optimal path selection module 326 selects a fastest URI for each state. For example, this may be the URI with the fewest number of hops (i.e., intents or user interface interactions). If two URIs have the same number of hops, the optimal path selection module 326 may use other criteria, such as preferring a URI that includes an intent, with the assumption that intents will be more consistent than user interface elements.
The optimal path selection module 326 provides the list of shortest URIs to a scraper 328. The URI list is used by the scraper 328 to reach each of the states of interest and extract their contents. The scraper 328 uses an emulator 332 to follow the shortest URI (which may be expressed as a breadcrumb file) for each of the states specified by the optimal path selection module 326. Within the emulator 332, the scraper 328 injects specified programming calls and replays user interface events as specified in the breadcrumb trail for each state of interest.
Upon arriving at the target state, the scraper 328 extracts text, images, and metadata from the state. This information is passed to a content parsing module 336. In other implementations, the data from the scraper 328 may be stored directly into a data warehouse or database.
The scraper 328 may be implemented as a scraping manager that concurrently runs multiple emulators including the emulator 332. Each of the emulators can then independently traverse paths to different states. The scraping manager therefore distributes the states of interest across the emulators and collates the scraped data. In various implementations, the scraping manager may interact with an application executing within one of the emulator instances using an actor similar to the link extractor actor 322, which interacts with the executing application using accessibility hooks/events. In various implementations, the emulator 332 may instead be a physical device, as described above.
The content parsing module 336 may identify content of interest from scraped states and map that data to specific fields to create app state records in the search data store 210. The content parsing module 336 may perform a combination of parsing, transformation, categorization, clustering, etc.
In FIG. 5, an example of a state list generated according to the method of FIG. 6 is shown. For each state, there may be multiple URIs, and in the example of FIG. 5 there is no state with only a single URI. The first state in the state list is for a state with information regarding a STARBUCKS store numbered 235. The unguided crawler will not necessarily recognize that the content of the state corresponds to a Starbucks store: the parenthetical has simply been added for illustration purposes.
The state (which a human would recognize as the restaurant info for Starbucks store #235) was reached from the home state in 3 different ways, resulting in 3 different URIs. The first is for a click or touch on an image (in the actual computer-readable state list, this may be a widget ID) from the homepage that happened to point to this listing. The second required two user interface interactions: one to select “Coffee and tea” from a list, and the next to select the first item from the resulting list of coffee/tea establishments. The third URI for this state involved a single programming call (intent) that specified the activity (or, app state template from which the restaurant info states are created) and an identifier of the state.
In FIG. 6, example operation of an unguided crawler begins at 404, where a home state of an app is added to a state table. At 408, the first state in the state table is selected. At 412, control navigates to the selected state using the most direct URI stored in the state table. At 416, control generates a list of URIs from the selected state. This list of URIs represents any other state that can be reached from the selected state. At 420, control selects a first URI from the generated URI list.
At 424, control determines whether the selected URI matches the URI of a state already in the state table. If so, control transfers to 428; otherwise, control transfers to 432. At 428, if the selected URI is the last URI in the list, control transfers to 436; otherwise, control transfers to 440. At 440, control selects the next URI from the URI list and transfers to 424.
At 432, control navigates to the selected URI, and at 444 determines whether an intent was fired when navigating to the selected URI. If so, control transfers to 448; otherwise, control transfers to 452. At 448, control records the fired intent as being parallel to the selected URI and continues at 452.
At 452, control determines whether the selected state (the state navigated to) matches the record of a state already in the state table. If so, control transfers to 456; otherwise, control transfers to 460. For example, control may compare a calculated hash value of content of the selected state to hash values already stored in the state table. For many hash algorithms, a close match of hash values does not imply that the hashed content is also a close match. As a result, the hash comparison may require an exact match. To compare the hash of the selected state to the stored hash values, a hash table of the hashes may be used (in which the hashes would be indexed by hashes of the hashes). In other implementations, a search (such as a binary search or quick search) may be performed of a sorted list of the hashes.
In addition to content comparison, control may also compare a user interface (UI) layout of the state to UI layouts stored in the state table. For example only, UI layout similarity may be used to confirm an apparent content match. If there is a textual content match between the selected state and an existing record of the state table (determined, for example, by a hash of the selected content of the selected state being equal to the hash of the existing record), a UI fingerprint of the selected state can be compared to a UI fingerprint of the existing record.
As described above, the UI fingerprint may include a width of the UI widget tree of a state, a depth (or height) of the UI widget tree of a state, a set of counts of predetermined types of widgets, etc. For example only, the UI fingerprint may even include a complete listing of all UI widgets of the state, along with their position within the tree. In various implementations, only when the content and the UI fingerprints both match will a match be declared between the selected state and the stored state.
The match between UI fingerprints may not require exact equality. Instead, a threshold percentage may be determined above which the UI fingerprints are considered a match. In some examples, certain elements of the UI fingerprints (such as, for example, depth and width) may require an exact match, while some divergence may be permitted among other elements.
At 456, control adds the selected URI and, if recorded, a parallel intent to the matching state in the state table. The selected URI serves as an alternate path capable of reaching the matching state. Control then continues at 428. At 460, control adds the URI as a new entry in the state table and, if applicable, adds the parallel intent to the new entry in the state table. Control continues at 464 where, if the selected state is the last state in the state table, control transfers to 468; otherwise, control returns to 436.
At 468, control determines whether a re-crawl is desired. If so, control transfers to 408; otherwise, control remains at 468. The re-crawl may be initiated on a periodic basis or in response to a known change, such as additional states being added to the app or a new version of the app being released.
In an API comparison, the response of an API from server can be used to determine any new data and can be used to discard duplicate data. The same method can also be used to incrementally scrape data to maintain freshness. For example, API calls for a certain state can be monitored, and then the API calls can be made directly to the server without having to visit the state in an emulator. A hash value may be calculated from the response returned to the API call. If the hash value matches a previously-recorded hash value for previously-received data, the new data can be discarded.

User Interface Comparison

Regarding UI pattern matching, the crawler can recognize different action UI elements, such as standard ANDROID operating system widgets. An example follows:


	<accesspath targetState=‘ActivityBusinessPage’>
	<action name=‘click’

	widgetId=‘com.YELP.android:id/nearby’ text=‘Nearby’
	/>

<action name=‘select’ widgetId=‘android:id/list’

	rowId=‘com.YELP.android:id/category_content’
	index=‘0’ />

<action name=‘select’

	widgetId=‘com.YELP.android:id/search_content’
	rowId=‘com.YELP.android:id/search_inner_layout’
	index=‘2’ />

	</accesspath>

States that have a different widget hierarchy (or, tree) are unlikely to be the same. Therefore, a fingerprint of the widget tree may be stored in the content storage 316. The fingerprint of the widget tree may include a count of each of various UI elements in the widget tree. To save storage space, the counts may be restricted to a limited number of the most common UI elements. In addition, the fingerprint may be based on a measure of depth or breadth of the widget tree.
The fingerprint may also be based on a count of the number of total UI elements located at each depth in the widget tree. The UI fingerprint of a new state can be compared against existing UI fingerprints, and if the similarity with an existing UI fingerprint is less than a threshold, the new state is considered not to be the same as that existing state.
In FIG. 7, an example widget tree for an application state is shown. A root node 500-1 has child nodes 500-2, 500-3, 500-4, and 500-5. Each widget may be described using a class identification and a resource ID. When no resource ID is present, only the class identification is used. A widget may be uniquely identified by the resource ID, meaning that there are no widgets having different class identifications but with the same resource ID.
The resource IDs are simply shown in FIG. 7 as an integer but may be strings of characters and may even be human-readable. In this example, the resource IDs for widgets 500-2, 500-3, 500-4, and 500-5 are 1, 1, 3, and 4, respectively. Widget 500-2 has leaf nodes 500-6, 500-7, 500-8, and 500-9. The widgets 500-6, 500-7, and 500-8 each have a resource ID of 2. Meanwhile, widget 500-9 does not have a resource ID and is therefore identified by a widget class. In this simple example, the class identifier is an integer, but as shown above, the class identifier may instead be a string of characters that may indicate a type of widget (such as a checkbox).
Widget 500-4 includes leaf nodes 500-10, 500-11, 500-12, and 500-13. Widgets 500-10 and 500-11 have a resource ID of 2, and widget 500-12 has a resource ID of 5. Widget 500-13 does not have a resource ID and is only identified as widget class1. As an example, widget subtree 500-2 (which includes leaf nodes 500-6, 500-7, 500-8, and 500-9) displays one search result (such as a restaurant listing) while the widget subtree 500-4 (including leaf nodes 500-10, 500-11, 500-12, and 500-13) displays another search result.
Apps, such as ANDROID apps, may have layout definitions with unique ids (the unique id may also be referred to as a resource-id). During crawling, a combination of unique ids is sequenced to reach a particular state, which may be similar to XPath (XML Path Language).
The user interface of a state may be identified by its unique set of widget nodes. This unique set forms the pattern to look for in other instances of the same widget. For example only, in a HOTELS.COM accommodation research and booking application, the widgets can be listed with their resource-id and class attributes:


Widget group	Count

resid =	1
“com.hcom.android:id/ser_res_p_card_current_photo”
class = “android.widget.ImageView”
resid = “com.hcom.android:id/ser_res_p_card_hotel_name”	1
class = “android.widget.TextView”
resid =	1
“com.hcom.android:id/ser_res_p_card_price_discounted”
class = “android.widget.TextView”
resid =	1
“com.hcom.android:id/ser_res_p_card_price_original”
class = “android.widget.TextView”
resid = “com.hcom.android:id/ser_res_p_qualitative_badge”	1
class = “android.widget.TextView”
resid = “com.hcom.android:id/ser_res_p_landmark_distance”	1
class = “android.widget.TextView”
resid = “com.hcom.android:id/ser_res_p_rooms_left”	1
class = “android.widget.TextView”
resid = “com.opentable:id/ratingbar”	1
class = “android.widget.RatingBar”
class = “android.view.View”	1
resid = “com.hcom.android:id/ser_res_p_card_wr_icon”	1
class = “android.view.View”

In various implementations, widgets from all level of the UI tree may be combined together in a list. In other implementations, a widget pattern may be created for each level of the tree and compared with corresponding levels of existing UI trees.
The grouping in the table above forms an input widget pattern for an existing state. A new state would have a widget pattern similar to this input widget pattern in order for the new state to be considered a match to the existing state. The comparison may involve computing a percentage change of the count of each widget node in the input pattern and then taking the average of the percentage values:
Let A=input UI widget pattern

- A={(g₀:c₀), (g₁:c₁), . . . , (g_n-1:c_n-1)}, where
- n=number of widget groups
- g_i=widget group i
- c_i=number of widget nodes in input list matching widget group g_i(the Count column in the above table)
  Let B=candidate UI widget pattern
- B={(g₀:C₀), (g₁:C₁), . . . , (g_n-1:C_n-1)} where
- g_i=widget group i
- C_i=number of widget nodes in candidate list matching widget group g_i
  Then, the percentage change of each widget group of the input widget pattern A compared to unidentified widget pattern B can be calculated as follows:

$p_{i} = \langle \frac{C_{i} - c_{i}}{c_{i}} \rangle$
and the average percentage change between B and A will be
$P = \frac{\sum_{i = 0}^{n - 1} p_{i}}{n}$
The average percentage change P will be 0 if the widget nodes of the input widget pattern are all present in the unidentified widget pattern. The average percentage change P can increase above 1.00 (100%) when the number of widget nodes in the unidentified widget pattern exceed those of the corresponding groups of the input widget pattern by more than double.
In the above numeric implementation of UI percentage comparison, only the widget groups present in A are relevant, while widget groups that appear only in B are ignored. This is consistent with the fact that a widget group not present in A would have a corresponding c_iof 0, leading to a divide-by-zero error.
In other implementations, widget groups appearing only in B may impact the percentage change score. For example, the percentage change may be subsequently adjusted based on widget groups present in B but not in A. in various implementations, the percentage change may be increased by an adjustment value. The adjustment value may be based on the number of widget nodes in pattern B that do not match any widget group from pattern A divided by the total number of widget nodes in pattern A. The adjustment value may be weighted (such as by a fraction less than one) before being added to the percentage change.
The average percentage change P can be compared with a threshold and, if less than the threshold, may be considered a match. For example only, the threshold may be set at 0.3. If the average percentage change P is less than the threshold, the new state is determined to not be a unique state not already visited. In various implementations, widget groups may simply be individual UI widget classes.
Returning to the example of FIG. 7, the widget pattern may be represented as:

{ResID1: 2, ResID2: 5, ResID3: 1, ResID4: 1, ResID5: 1, Class1: 2}

For example, if the state whose widget tree is shown in FIG. 7 is a candidate state, the widget list can be compared to widget lists of other states already stored in the state table.
The widget pattern may also be annotated with the height of the tree, which in FIG. 7 may be described as two or as three, depending on whether the head node is included in the height. The width of the tree may be determined by counting the number of nodes at each level of the tree and selecting the largest count as the width. The width of the tree may alternatively be determined by counting the number of leaf nodes in the widget tree.
The height and width of the tree may be used as matching litmus tests. For example, if the height of a new state doesn't match the height of an existing state, the new state may be presumptively considered as not a match for the existing state. As another example, if the width of a new state differs by more than 10% from the width of an existing state, the new state may be presumptively considered as not a match for the existing state.
In various implementations, the widget pattern (specified by counts of specific types, or groups, of widgets) of each state may be converted to a reduced form, such as by calculating a hash, as described above. If a hash of the widget pattern of a new state matches one of the stored hashes (corresponding to existing states), the height and width of the new state may be compared to stored values for height and width corresponding to the matching stored state. The height and width may therefore serve as verification of a hash match.

Overall

The foregoing description is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. The broad teachings of the disclosure can be implemented in a variety of forms. Therefore, while this disclosure includes particular examples, the true scope of the disclosure should not be so limited since other modifications will become apparent upon a study of the drawings, the specification, and the following claims. It should be understood that one or more steps within a method may be executed in different order (or concurrently) without altering the principles of the present disclosure. Further, although each of the embodiments is described above as having certain features, any one or more of those features described with respect to any embodiment of the disclosure can be implemented in and/or combined with features of any of the other embodiments, even if that combination is not explicitly described. In other words, the described embodiments are not mutually exclusive, and permutations of one or more embodiments with one another remain within the scope of this disclosure.
Spatial and functional relationships between elements (for example, between modules) are described using various terms, including “connected,” “engaged,” “interfaced,” and “coupled.” Unless explicitly described as being “direct,” when a relationship between first and second elements is described in the above disclosure, that relationship encompasses a direct relationship where no other intervening elements are present between the first and second elements, and also an indirect relationship where one or more intervening elements are present (either spatially or functionally) between the first and second elements. As used herein, the phrase at least one of A, B, and C should be construed to mean a logical (A OR B OR C), using a non-exclusive logical OR, and should not be construed to mean “at least one of A, at least one of B, and at least one of C.”
In this application, including the definitions below, the term ‘module’ or the term ‘controller’ may be replaced with the term ‘circuit.’ The term ‘module’ may refer to, be part of, or include processor hardware (shared, dedicated, or group) that executes code and memory hardware (shared, dedicated, or group) that stores code executed by the processor hardware.
The module may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module of the present disclosure may be distributed among multiple modules that are connected via interface circuits. For example, multiple modules may allow load balancing. In a further example, a server (also known as remote, or cloud) module may accomplish some functionality on behalf of a client module.
The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects. Shared processor hardware encompasses a single microprocessor that executes some or all code from multiple modules. Group processor hardware encompasses a microprocessor that, in combination with additional microprocessors, executes some or all code from one or more modules. References to multiple microprocessors encompass multiple microprocessors on discrete dies, multiple microprocessors on a single die, multiple cores of a single microprocessor, multiple threads of a single microprocessor, or a combination of the above.
Shared memory hardware encompasses a single memory device that stores some or all code from multiple modules. Group memory hardware encompasses a memory device that, in combination with other memory devices, stores some or all code from one or more modules.
The term memory hardware is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium is therefore considered tangible and non-transitory. Non-limiting examples of a non-transitory computer-readable medium are nonvolatile memory devices (such as a flash memory device, an erasable programmable read-only memory device, or a mask read-only memory device), volatile memory devices (such as a static random access memory device or a dynamic random access memory device), magnetic storage media (such as an analog or digital magnetic tape or a hard disk drive), and optical storage media (such as a CD, a DVD, or a Blu-ray Disc).
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks and flowchart elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
The computer programs include processor-executable instructions that are stored on at least one non-transitory computer-readable medium. The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc.
The computer programs may include: (i) descriptive text to be parsed, such as HTML (hypertext markup language) or XML (extensible markup language), (ii) assembly code, (iii) object code generated from source code by a compiler, (iv) source code for execution by an interpreter, (v) source code for compilation and execution by a just-in-time compiler, etc. As examples only, source code may be written using syntax from languages including C, C++, C#, Objective-C, Haskell, Go, SQL, R, Lisp, Java®, Fortran, Perl, Pascal, Curl, OCaml, Javascript®, HTML5, Ada, ASP (active server pages), PHP, Scala, Eiffel, Smalltalk, Erlang, Ruby, Flash®, Visual Basic®, Lua, and Python®.
None of the elements recited in the claims are intended to be a means-plus-function element within the meaning of 35 U.S.C. §112(f) unless an element is expressly recited using the phrase “means for” or, in the case of a method claim, using the phrases “operation for” or “step for.”

Claims

What is claimed is:

1. A system for automated acquisition of content from an application, the system comprising:

a state storage module configured to store application state records, wherein each record of the application state records includes a representation of content of a corresponding application state of the application;

a link tracking module configured to control an executing instance of the application to navigate to a first application state of the application;

a duplicate content detector configured to (i) calculate a representation of content of the first application state, (ii) compare the calculated representation to the stored representations of content in the application state records, and (iii) generate a comparison signal based on the comparison, wherein:

the comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records,

the link tracking module is configured to create a new application state record in the state storage module only in response to the comparison signal indicating that no match is found, and

the calculated representation is stored in the new application state record; and

a scraper module configured to, for each of the application state records in the state storage module, extract text and metadata from the corresponding application state, wherein information based on the extracted text and metadata is stored in a data store.

2. The system of claim 1 wherein:

each record of the application state records includes a unique identifier that uniquely identifies the corresponding application state among the application state records of the state storage module; and

the duplicate content detector is configured to calculate the representation of content of the first application state only in response to a unique identifier of the first application state not matching any of the stored unique identifiers in the application state records of the state storage module.

3. The system of claim 2 wherein the link tracking module is configured to:

in response to the comparison signal indicating that the match is found between the calculated representation and a stored representation of a second record of the application state records, add the unique identifier of the first application state to the second record; and

in response to the comparison signal indicating that no match is found, store the unique identifier of the first application state in the new application state record.

4. The system of claim 2 wherein:

for each record of the application state records, the unique identifier identifies a path followed within the executing application instance from a default state of the application to the corresponding application state; and

the unique identifier of the first application state indicates a path followed within the executing application instance from the default state of the application to the first application state.

5. The system of claim 4 wherein, for each record of the application state records, the path includes user interface interactions performed within the executing application instance in order to navigate from the default state of the application to the corresponding application state.

6. The system of claim 1 wherein:

each record of the application state records includes a user interface (UI) fingerprint of the corresponding application state; and

the duplicate content detector is configured to, in response to the calculated representation matching the stored representation of content of a second application state record:

determine a UI fingerprint of the first application state;

compare the UI fingerprint of the first application state to the stored UI fingerprint of the second application state record; and

generate the comparison signal indicating that no match is found in response to the UI fingerprint of the first application state not matching the UI fingerprint of the second application state record.

7. The system of claim 6 wherein, for each record of the application state records, the UI fingerprint includes at least one of (i) a width of a UI widget tree extracted from the corresponding application state and (ii) a depth of the UI widget tree extracted from the corresponding application state.

8. The system of claim 6 wherein, for each record of the application state records, the UI fingerprint includes a list of counts for predetermined UI widget types, each of the counts representing a number of occurrences of the predetermined UI widget type in the corresponding application state.

9. The system of claim 8 wherein the duplicate content detector is configured to generate the comparison signal indicating that no match is found in response to a percentage difference between the UI fingerprint of the first application state and the UI fingerprint of the second application state record being greater than a predetermined threshold.

10. The system of claim 1 wherein the duplicate content detector is configured to calculate the representation of the content of the first application state by calculating a hash value of the content of the first application state.

11. The system of claim 10 wherein the content of the first application state includes text displayed within the first application state.

12. The system of claim 1 wherein the link tracking module is configured to execute the instance of the application within an emulator.

13. The system of claim 1 wherein the link tracking module is configured to:

identify a set of application states reachable from the first application state, wherein each of the set of application states is reachable via a respective user interface interaction with the first application state; and

control the executing instance of the application to navigate to each of the set of application states in turn.

14. The system of claim 1 wherein:

the link tracking module is configured to identify a request made by the application to a remote server in order for the application to render the first application state; and

the duplicate content detector is configured to generate the comparison signal indicating a match between the first application state and a second application state record in response to parameters of the identified request matching stored request parameters in the second application state record.

15. A search system comprising:

the system of claim 1;

the data store;

a set generation module configured to, in response to a query from a user device, select records from the data store to form a consideration set of records;

a set processing module configured to assign a score to each record of the consideration set of records; and

a results generation module configured to respond to the user device with a subset of the consideration set of records, wherein the subset is selected based on the assigned scores, and wherein the subset identifies application states of applications that are relevant to the query.

16. A method for automated acquisition of content from an application, the method comprising:

storing application state records, wherein each record of the application state records includes a representation of content of a corresponding application state of the application;

controlling an executing instance of the application to navigate to a first application state of the application;

calculating a representation of content of the first application state;

comparing the calculated representation to the stored representations of content in the application state records;

generating a comparison signal based on the comparison, wherein the comparison signal indicates whether a match is found between the calculated representation and any of the stored representations of content in the application state records;

only in response to the comparison signal indicating that no match is found, creating and storing a new application state record for the first application state, wherein the calculated representation is stored in the new application state record; and

for each of the application state records, extracting text and metadata from the corresponding application state, wherein information based on the extracted text and metadata is stored in a data store.

17. The method of claim 16 wherein:

each record of the application state records includes a unique identifier that uniquely identifies the corresponding application state among the application state records; and

the calculating the representation of content of the first application state is performed only in response to a unique identifier of the first application state not matching any of the stored unique identifiers in the application state records.

18. The method of claim 17 further comprising:

in response to the comparison signal indicating that the match is found between the calculated representation and a stored representation of a second record of the application state records, adding the unique identifier of the first application state to the second record; and

in response to the comparison signal indicating that no match is found, storing the unique identifier of the first application state in the new application state record.

19. The method of claim 17 wherein:

20. The method of claim 19 wherein, for each record of the application state records, the path includes user interface interactions performed within the executing application instance in order to navigate from the default state of the application to the corresponding application state.

21. The method of claim 16 wherein:

the method further comprises, in response to the calculated representation matching the stored representation of content of a second application state record:

determining a UI fingerprint of the first application state;

comparing the UI fingerprint of the first application state to the stored UI fingerprint of the second application state record; and

generating the comparison signal indicating that no match is found in response to the UI fingerprint of the first application state not matching the UI fingerprint of the second application state record.

22. The method of claim 21 wherein, for each record of the application state records, the UI fingerprint includes at least one of (i) a width of a UI widget tree extracted from the corresponding application state and (ii) a depth of the UI widget tree extracted from the corresponding application state.

23. The method of claim 21 wherein, for each record of the application state records, the UI fingerprint includes a list of counts for predetermined UI widget types, each of the counts representing a number of occurrences of the predetermined UI widget type in the corresponding application state.

24. The method of claim 23 further comprising generating the comparison signal indicating that no match is found in response to a percentage difference between the UI fingerprint of the first application state and the UI fingerprint of the second application state record being greater than a predetermined threshold.

25. The method of claim 16 further comprising calculating the representation of the content of the first application state by calculating a hash value of the content of the first application state.

26. The method of claim 25 wherein the content of the first application state includes text displayed within the first application state.

27. The method of claim 16 further comprising:

identifying a set of application states reachable from the first application state, wherein each of the set of application states is reachable via a respective user interface interaction with the first application state; and

controlling the executing instance of the application to navigate to each of the set of application states in turn.

28. The method of claim 16 further comprising:

identifying a request made by the application to a remote server in order for the application to render the first application state; and

generating the comparison signal indicating a match between the first application state and a second application state record in response to parameters of the identified request matching stored request parameters in the second application state record.