WO2008134172A1 - Web spam page classification using query-dependent data - Google Patents

Web spam page classification using query-dependent data Download PDF

Info

Publication number
WO2008134172A1
WO2008134172A1 PCT/US2008/058637 US2008058637W WO2008134172A1 WO 2008134172 A1 WO2008134172 A1 WO 2008134172A1 US 2008058637 W US2008058637 W US 2008058637W WO 2008134172 A1 WO2008134172 A1 WO 2008134172A1
Authority
WO
WIPO (PCT)
Prior art keywords
web
search query
pages
features
spam
Prior art date
Application number
PCT/US2008/058637
Other languages
French (fr)
Inventor
Krysta Svore
Chris Burges
Original Assignee
Microsoft Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corporation filed Critical Microsoft Corporation
Publication of WO2008134172A1 publication Critical patent/WO2008134172A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Definitions

  • a search engine retrieves pages relevant to a user's query by comparing the attributes of pages, together with other features such as anchor text, and returns those that best match the query. The user is then typically shown a ranked list of Universal Resource Locators (URLs) of between 10 and 20 per page.
  • URLs Universal Resource Locators
  • Some commercial companies to increase their website traffic, hire search engine optimization (SEO) companies to improve their site's ranking.
  • SEO search engine optimization
  • There are numerous ways to improve a site's ranking which may be broadly categorized as white-hat and gray-hat (or black-hat) SEO techniques.
  • White-hat SEO methods focus on improving the quality and content of a page so that the information on the page is useful for many users.
  • Such a method for improving a site's rank may be to improve the content of the site, such that it appears most relevant for those queries one would like to target.
  • Gray-hat and black-hat SEO techniques include methods such as link stuffing, keyword stuffing, cloaking, web farming, and so on.
  • Link stuffing is the practice of creating many pages that have little content or duplicated content, all which link to a single optimized target page. Ranking based on link structure can be fooled by link-stuffed pages by thinking that the target page is a better page since so many pages link to it. Keyword stuffing is when a page is filled with query terms, making it appear very relevant to a search whose query contains one or more of those terms, even though the actual relevance of the page may be low. A keyword-stuffed page will rank higher since it appears to have content relevant to the query. For over-loaded terms on the page, the page will appear higher in the search results and will ultimately draw users to click on the site.
  • White-hat techniques can result in a better web search experience for a user.
  • the user is provided with satisfactory results based on the original intent of the search query.
  • gray-hat or black-hat techniques can derail the user's search process in the hope of persuading the user to buy something they were not originally looking for.
  • a web spam page classifier that identifies web spam pages based on features of a search query and web page pair.
  • the web page is identified by a URL.
  • the features can be identified based on spamming techniques that detract from a search experience for a user.
  • the features can be extracted from training instances and a training algorithm can be employed to develop the classifier.
  • pages identified as web spam pages can be demoted and/or removed from a relevancy ranked list.
  • FIG. 1 is a block diagram of a search engine system.
  • FIG. 2 is a flow diagram of a method for rendering search results to a user based on a query.
  • FIG. 3 is a block diagram of a classifier training system.
  • FIG. 4 is a flow diagram of a method for training a classifier.
  • FIG. 5 is a block diagram of a general computing environment.
  • FIG. 1 is a block diagram of a search engine system 100 for receiving a search query and rendering a list of results based on the search query.
  • system 100 receives a search query 102 from a user.
  • the search query includes one or more terms for which the user desires to acquire further information.
  • the user operates a computing device such as a mobile device or personal computer to enter the search query 102.
  • the search query 102 is sent to a search engine module 104 through a suitable means of communication.
  • Search engine module 104 processes search query 102 to extract and/or identify features or attributes of the query and provide these features to a ranking module 106 and a web spam classification module 108.
  • ranking module 106 and web spam classification module 108 can be combined into a single module as desired.
  • Ranking module 106 accesses an index 110 that contains information related to a plurality of web pages. Web pages are identified uniquely by a URL. The URL includes a particular domain as well as a path to the particular page. There are several known methods for creating an index of web pages that can be accessed based on a search query. Among other things, index 110 can catalog words, titles, subtitles, metatags and/or any other information related to a particular web page and its corresponding domain. Based on a comparison of search query 102 to index 110, ranking module 106 outputs a ranked list 112 of web pages ranked by relevance to the query. Ranked list 112 can be rendered in a web page where the top results (i.e.
  • Ranking module 106 can be driven by a ranking algorithm that identifies features of the search query 102, web pages under consideration and relationships of terms in the search query with content of the web pages.
  • Example features include a most frequent term in the web page, a number of times a particular term appears in the web page, a domain name associated with the web page (i.e., www.example.com), a number of links pointing to the page, whether a query term appears in a title of the web page, etc.
  • the ranking algorithm can perform a calculation as to the relevancy value of a page given a query. This calculation can be performed as a function of feature vectors.
  • ranked list 112 may include one or more web spam pages. These web spam pages are designed to increase ranking based on ranking module 106 and do not increase quality or content of the page. These web spam pages are of limited value to a user and simply do not reveal as high quality of content of an actual relevant page. Elimination of web spam pages from search results can be important for several reasons. A user may choose to use a different search engine if web spam pages receive high rankings and other legitimate sites may utilize spamming techniques to improve ratings. Ultimately, web spam pages negatively affect user experience, rankings of pages and processing costs.
  • system 100 utilizes web spam classification module 108 to identify a list of web spam pages 114 as a function of the search query 102 and a comparison to index 110.
  • web spam classification module 108 can identify spam pages directly from ranked list 112. Lists 112 and 114 are combined to provide an updated list 116 that can be provided to the user. Updated list 116 can demote web spam pages to a lower position in list 112 and/or eliminate web spam pages from the list.
  • Web spam classification module 108 can be driven by a classifier designed to label a given web page as spam or non-spam.
  • the classifier can perform a calculation based on the same or similar features utilized by the ranking algorithm associated with ranking module 106. This use of features can be advantageous since features for the query and pages are already extracted for operation of ranking module 106. Furthermore, by training a classifier on these features to locate spam web pages, outliers of a distribution for highly ranked pages can be an indication of web spam. For example, a web page that stuffs numerous keywords into its page for the purpose of increasing its relevancy ranking will be likely to have more keywords than a legitimate site for a particular keyword. By training the classifier to recognize these situations, web spam pages can be more accurately identified.
  • FIG. 2 is a method 200 performed by system 100 to provide list 116 to the user.
  • the search query 102 is accessed by search engine module 104.
  • search engine module 104 In Internet based search engines, a user is located remotely from the search engine module and enters the search query into a suitable web browser.
  • query terms from search query 102 are compared to index 110 and the ranked list of relevant pages is provided at step 206.
  • Web spam pages are then identified at step 208 using web spam classification module 108 and an updated ranked list is provided at step 210. If one or more web spam pages are identified they can be removed from the ranked list or demoted to a lower ranking in the list, for example, by 10, 20, 25 or more positions.
  • FIG. 3 is a block diagram of a system 300 for training a classifier of web spam classification module 108.
  • System 300 includes a training module 302 that receives input in the form of training instances 304 and a feature model 306. Based on the training instances 304 and feature model 306, a classifier 308 is output.
  • Feature model 306 includes spam based features 310, rank-time query independent features 312 and rank-time query dependent features 314.
  • classifier 308 labels a given web page as spam or not spam.
  • the training module utilizes training and testing data that is composed of a number of labeled instances 304, or samples, where each sample has a vector of attributes, or features. In one example, the labels are determined by human judges. Classification involves producing a feature model 306 during training to predict the label of each instance in the test in a set given only the vector of feature values.
  • training instances 304 are used to determine the parameters of classifier 308.
  • the classifier 308 examines a vector of features jointly to determine, based on the feature values, if a web page is spam or not.
  • the classifier 308 is evaluated during testing by comparing the label given by the classifier with the instance's assigned label.
  • the classifier 308 can be trained using a suitable learning method.
  • One example is a support vector machine (SVM).
  • SVM support vector machine
  • a SVM produces a linear separating hyperplane between the feature vectors for two class labels (i.e. spam and non-spam) in a transformed version of the feature space. Instances are then classified based on where they lie in a transformed version of feature space.
  • the SVM finds the separating hyperplane with maximal margin in the high-dimensional space.
  • Classifier 308 is based on page-level, content-based classification, as opposed to host- level classification or link-level classification. It is worth noting that classifier 308 could be used in conjunction with a domain-level or link-level classifier, by using classifier 308 at rank time and another classifier at index-generation time, for example. Features based on domain and link information, however, may be used in the classifier.
  • Training instances 304 include a large set of human-labeled (query, URL) pairs, although other approaches for obtaining training instances 304 can be used. In one instance, queries can be determined by looking at search engine query logs for a search engine such as Microsoft Live Search (available at www.live.com) provided by Microsoft Corporation of Redmond, Washington.
  • the queries can be sampled such that the set of queries represent a realistic distribution of queries that users would submit to a search engine.
  • the query frequencies are determined from toolbar data as well as query logs. Queries include commercial queries, spam, queries, and non-commercial queries.
  • a human judge is given the list of queries and issues each query to a search engine. A returned list of 10 results with snippets is shown to the judge. For each URL appearing in the top 10 returned search results, the judge labels the URL as spam, not spam, or unknown. The judgment is made based on the quality of content, the use of obvious spam techniques, and whether or not the result should appear in the top 10.
  • Spam features 310 include page-level attributes.
  • the attributes include domain-level features, page-level features, and link information. Values of these features can be determined by mining feature information for each URL in the testing and training sets. Examples of such features include the number of spammy in-links, the top level domain of the site, quality of phrases in the document and density of keywords (spammy terms). The number of spammy in-links is the number of in-links coming from labeled spam pages. The quality of phrases in the document is a score that indicates the quality of terms on the page.
  • the density of keywords is a score that indicates how many terms on the page are spam terms.
  • feature model 306 includes rank-time features.
  • Rank-time features are features extracted for use in the ranking algorithm.
  • a large number of web spam pages appear in ranked search results.
  • web spam pages In order to receive a high rank, web spam pages must contain content that "fools" algorithms used for populating the index and for ranking search results. These algorithms take feature vectors as input, where the feature vectors have a specific distribution over the feature set. The distribution is difficult for a spammer to match without knowledge of how the crawling, indexing, and ranking algorithms work.
  • ranking module 106 believes the web spam page to be highly relevant, classifier 308, with the same feature data as input, but trained on spam labels, should be able to easily identify web spam pages, since they will be outliers of the distribution.
  • Ranking module 106 is trained to solve a different problem, namely the ordering of relevant pages, not the identification of web spam pages, by using a separate classifier trained to catch web spam, web spam can be demoted and/or removed in the ranked results.
  • Rank-time features can be separated into query-independent features 312 and query- dependent features 314.
  • the query-independent rank-time features 312 can be grouped into page-level features, domain-level features, anchor features, popularity features, and time features.
  • Page-level features are features that can be determined by looking just at a page or URL. Examples of page-level features include static rank features, the count of the most frequent term, the count of the number of unique terms, the total number of terms, the number of words in the path, and the number of words in the title.
  • Domain-level features are computed as averages across all pages in a domain while other modes of calculation can be used. Examples of domain-level features include the rank of the domain, the average number of words, and the top-level domain.
  • Popularity features are features that measure the popularity of pages through user data. Popularity features can be derived from toolbar data, where the user has agreed to provide access to data collected during a logged session.
  • the popularity features can include domain- and page-level features.
  • Example features are the number of hits within a domain, the number of users of a domain, the number of hits on a URL, and the number of users of a URL.
  • Time features include the date the URL was crawled, the last date page changed, and the time since the page was crawled. Other features, such as frequent term counts, anchor text features, etc. can also be used. Exemplary rank-time query-independent features are listed in
  • Page level static rank, most frequent term, number of unique terms, total number of terms, number of words in path, number of words in title Domain level: domain rank, average number of words, top-level domain
  • Table 1 Rank-time query independent features.
  • Query-dependent features 314 are content features that relate to the one or more terms in search query 102. This feature set can include several hundred query-dependent features.
  • Query-dependent features are generated from the query, document content, and URL.
  • Query-dependent features can depend just on the query or on the match between query and document properties. Examples of query-dependent features include the number of query terms in the title and the frequency of a query term on the page, as well as various counts of the occurrences of the query term across all documents, the number of documents that contain the query term, and n-grams over the query terms and the document.
  • Table 2 lists several example query dependent features that can be used in classifier 308.
  • classifier 308 Although query-dependent features are utilized in classifier 308, a spam label is not necessarily query-dependent. Currently, however, classifier 308 is utilized such that a query is issued and then each returned page is examined to determine if it is spam or not. In instances where different queries yield different results as to whether a page is spam or not can be corrected if desired.
  • FIG. 4 is a method 400 for outputting classifier 308 that can be used by web spam classification module 108.
  • labeled training instances including a query and a URL are obtained. These instances include both non-spam web pages and spam web pages.
  • a feature based model is obtained at step 404 to identify features of web pages as well as features of relationships between queries and web pages.
  • the classifier is trained to identify web spam pages given a query. Furthermore, at step 408, the classifier is output.
  • rank-time content can be effective in classifying web spam pages Since one goal of spammers is to generate data to fool search engine ranking algorithms, identifying tell-tale properties of data in web spam pates can be detected by secondary classifiers.
  • FIG. 5 The computing environment shown in FIG. 5 is one such example that can be used to implement a portion or all of these systems and/or methods.
  • the computing system environment 500 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the claimed subject matter. Neither should the computing environment 500 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary computing environment 500.
  • Computing environment 500 illustrates a general purpose computing system environment or configuration.
  • Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the service agent or a client device include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
  • Exemplary environment 500 for implementing the above embodiments includes a general-purpose computing system or device in the form of a computer 510.
  • Components of computer 510 may include, but are not limited to, a processing unit 520, a system memory 530, and a system bus 521 that couples various system components including the system memory to the processing unit 520.
  • the system bus 521 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • bus architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • EISA Enhanced ISA
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Computer 510 typically includes a variety of computer readable media.
  • Computer readable media can be any available media that can be accessed by computer 510 and includes both volatile and nonvolatile media, removable and non-removable media.
  • computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • the system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532.
  • the computer 510 may also include other removable/non-removable volatile/nonvolatile computer storage media.
  • Non-removable non-volatile storage media are typically connected to the system bus 521 through a non-removable memory interface such as interface 540.
  • Removeable non-volatile storage media are typically connected to the system bus 521 by a removable memory interface, such as interface 550.
  • a user may enter commands and information into the computer 510 through input devices such as a keyboard 562, a microphone 563, a pointing device 561, such as a mouse, trackball or touch pad, and a video camera 564.
  • input devices such as a keyboard 562, a microphone 563, a pointing device 561, such as a mouse, trackball or touch pad, and a video camera 564.
  • These and other input devices are often connected to the processing unit 520 through a user input interface 560 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port or a universal serial bus (USB).
  • a monitor 591 or other type of display device is also connected to the system bus 521 via an interface, such as a video interface 590.
  • computer 510 may also include other peripheral output devices such as speakers 597, which may be connected through an output peripheral interface 595.
  • the computer 510 When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet.
  • the modem 572 which may be internal or external, may be connected to the system bus 521 via the user input interface 560, or other appropriate mechanism.
  • program modules depicted relative to the computer 510, or portions thereof may be stored in the remote memory storage device.
  • FIG. 5 illustrates remote application programs 585 as residing on remote computer 580. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between computers may be used.

Abstract

A web spam page classifier is described that identifies web spam pages based on features of a search query and web page pair. The features can be extracted from training instances and a training algorithm can be employed to develop the classifier. Pages identified as web spam pages can be demoted and/or removed from a relevancy ranked list.

Description

WEB SPAM PAGE CLASSIFICATION USING QUERY- DEPENDENT DATA
BACKGROUND
[0001] As the amount of information on the World Wide Web grows, the use of search engines to find relevant information becomes ever more critical. A search engine retrieves pages relevant to a user's query by comparing the attributes of pages, together with other features such as anchor text, and returns those that best match the query. The user is then typically shown a ranked list of Universal Resource Locators (URLs) of between 10 and 20 per page. The ranking of pages by search engines has proved to be a crucial component of how users browse the web. This situation arises not only from the simple gathering of information for users, but also from commercial transactions that result from the search activity.
[0002] Some commercial companies, to increase their website traffic, hire search engine optimization (SEO) companies to improve their site's ranking. There are numerous ways to improve a site's ranking, which may be broadly categorized as white-hat and gray-hat (or black-hat) SEO techniques. White-hat SEO methods focus on improving the quality and content of a page so that the information on the page is useful for many users. Such a method for improving a site's rank may be to improve the content of the site, such that it appears most relevant for those queries one would like to target. [0003] However, there are many methods of improving ranking. Gray-hat and black-hat SEO techniques include methods such as link stuffing, keyword stuffing, cloaking, web farming, and so on. Link stuffing is the practice of creating many pages that have little content or duplicated content, all which link to a single optimized target page. Ranking based on link structure can be fooled by link-stuffed pages by thinking that the target page is a better page since so many pages link to it. Keyword stuffing is when a page is filled with query terms, making it appear very relevant to a search whose query contains one or more of those terms, even though the actual relevance of the page may be low. A keyword-stuffed page will rank higher since it appears to have content relevant to the query. For over-loaded terms on the page, the page will appear higher in the search results and will ultimately draw users to click on the site.
[0004] White-hat techniques can result in a better web search experience for a user. In particular, the user is provided with satisfactory results based on the original intent of the search query. In contrast, gray-hat or black-hat techniques can derail the user's search process in the hope of persuading the user to buy something they were not originally looking for.
SUMMARY
[0005] A web spam page classifier is described that identifies web spam pages based on features of a search query and web page pair. The web page is identified by a URL. The features can be identified based on spamming techniques that detract from a search experience for a user. The features can be extracted from training instances and a training algorithm can be employed to develop the classifier. During operation, pages identified as web spam pages can be demoted and/or removed from a relevancy ranked list. [0006] This Summary is provided to introduce some concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS [0007] FIG. 1 is a block diagram of a search engine system.
[0008] FIG. 2 is a flow diagram of a method for rendering search results to a user based on a query.
[0009] FIG. 3 is a block diagram of a classifier training system.
[0010] FIG. 4 is a flow diagram of a method for training a classifier. [0011] FIG. 5 is a block diagram of a general computing environment.
DETAILED DESCRIPTION
[0012] FIG. 1 is a block diagram of a search engine system 100 for receiving a search query and rendering a list of results based on the search query. During operation, system 100 receives a search query 102 from a user. The search query includes one or more terms for which the user desires to acquire further information. In some instances, the user operates a computing device such as a mobile device or personal computer to enter the search query 102. In turn, the search query 102 is sent to a search engine module 104 through a suitable means of communication. Search engine module 104 processes search query 102 to extract and/or identify features or attributes of the query and provide these features to a ranking module 106 and a web spam classification module 108. Although herein illustrated and discussed as being separate modules, ranking module 106 and web spam classification module 108 can be combined into a single module as desired.
[0013] Ranking module 106 accesses an index 110 that contains information related to a plurality of web pages. Web pages are identified uniquely by a URL. The URL includes a particular domain as well as a path to the particular page. There are several known methods for creating an index of web pages that can be accessed based on a search query. Among other things, index 110 can catalog words, titles, subtitles, metatags and/or any other information related to a particular web page and its corresponding domain. Based on a comparison of search query 102 to index 110, ranking module 106 outputs a ranked list 112 of web pages ranked by relevance to the query. Ranked list 112 can be rendered in a web page where the top results (i.e. the top 10 or 20 results) are rendered with hyperlinks that link to web pages. [0014] Ranking module 106 can be driven by a ranking algorithm that identifies features of the search query 102, web pages under consideration and relationships of terms in the search query with content of the web pages. Example features include a most frequent term in the web page, a number of times a particular term appears in the web page, a domain name associated with the web page (i.e., www.example.com), a number of links pointing to the page, whether a query term appears in a title of the web page, etc. Based on these features, the ranking algorithm can perform a calculation as to the relevancy value of a page given a query. This calculation can be performed as a function of feature vectors. [0015] In some instances, ranked list 112 may include one or more web spam pages. These web spam pages are designed to increase ranking based on ranking module 106 and do not increase quality or content of the page. These web spam pages are of limited value to a user and simply do not reveal as high quality of content of an actual relevant page. Elimination of web spam pages from search results can be important for several reasons. A user may choose to use a different search engine if web spam pages receive high rankings and other legitimate sites may utilize spamming techniques to improve ratings. Ultimately, web spam pages negatively affect user experience, rankings of pages and processing costs. [0016] To combat web spam pages, system 100 utilizes web spam classification module 108 to identify a list of web spam pages 114 as a function of the search query 102 and a comparison to index 110. Alternatively, web spam classification module 108 can identify spam pages directly from ranked list 112. Lists 112 and 114 are combined to provide an updated list 116 that can be provided to the user. Updated list 116 can demote web spam pages to a lower position in list 112 and/or eliminate web spam pages from the list. Web spam classification module 108 can be driven by a classifier designed to label a given web page as spam or non-spam.
[0017] In one example, the classifier can perform a calculation based on the same or similar features utilized by the ranking algorithm associated with ranking module 106. This use of features can be advantageous since features for the query and pages are already extracted for operation of ranking module 106. Furthermore, by training a classifier on these features to locate spam web pages, outliers of a distribution for highly ranked pages can be an indication of web spam. For example, a web page that stuffs numerous keywords into its page for the purpose of increasing its relevancy ranking will be likely to have more keywords than a legitimate site for a particular keyword. By training the classifier to recognize these situations, web spam pages can be more accurately identified.
[0018] FIG. 2 is a method 200 performed by system 100 to provide list 116 to the user. At step 202, the search query 102 is accessed by search engine module 104. In Internet based search engines, a user is located remotely from the search engine module and enters the search query into a suitable web browser. At step 204, query terms from search query 102 are compared to index 110 and the ranked list of relevant pages is provided at step 206. Web spam pages are then identified at step 208 using web spam classification module 108 and an updated ranked list is provided at step 210. If one or more web spam pages are identified they can be removed from the ranked list or demoted to a lower ranking in the list, for example, by 10, 20, 25 or more positions. The updated list is output to the user at step 212. [0019] FIG. 3 is a block diagram of a system 300 for training a classifier of web spam classification module 108. System 300 includes a training module 302 that receives input in the form of training instances 304 and a feature model 306. Based on the training instances 304 and feature model 306, a classifier 308 is output. Feature model 306 includes spam based features 310, rank-time query independent features 312 and rank-time query dependent features 314.
[0020] To detect web spam pages, classifier 308 labels a given web page as spam or not spam. To develop classifier 308, the training module utilizes training and testing data that is composed of a number of labeled instances 304, or samples, where each sample has a vector of attributes, or features. In one example, the labels are determined by human judges. Classification involves producing a feature model 306 during training to predict the label of each instance in the test in a set given only the vector of feature values. To construct classifier 308, training instances 304 are used to determine the parameters of classifier 308. During testing, the classifier 308 examines a vector of features jointly to determine, based on the feature values, if a web page is spam or not. The classifier 308 is evaluated during testing by comparing the label given by the classifier with the instance's assigned label. [0021] The classifier 308 can be trained using a suitable learning method. One example is a support vector machine (SVM). However, there are many different classification algorithms that can be used to detect web spam. At a high level, a SVM produces a linear separating hyperplane between the feature vectors for two class labels (i.e. spam and non-spam) in a transformed version of the feature space. Instances are then classified based on where they lie in a transformed version of feature space. The SVM finds the separating hyperplane with maximal margin in the high-dimensional space.
[0022] Classifier 308 is based on page-level, content-based classification, as opposed to host- level classification or link-level classification. It is worth noting that classifier 308 could be used in conjunction with a domain-level or link-level classifier, by using classifier 308 at rank time and another classifier at index-generation time, for example. Features based on domain and link information, however, may be used in the classifier. Training instances 304 include a large set of human-labeled (query, URL) pairs, although other approaches for obtaining training instances 304 can be used. In one instance, queries can be determined by looking at search engine query logs for a search engine such as Microsoft Live Search (available at www.live.com) provided by Microsoft Corporation of Redmond, Washington. The queries can be sampled such that the set of queries represent a realistic distribution of queries that users would submit to a search engine. The query frequencies are determined from toolbar data as well as query logs. Queries include commercial queries, spam, queries, and non-commercial queries. [0023] When using human labels, a human judge is given the list of queries and issues each query to a search engine. A returned list of 10 results with snippets is shown to the judge. For each URL appearing in the top 10 returned search results, the judge labels the URL as spam, not spam, or unknown. The judgment is made based on the quality of content, the use of obvious spam techniques, and whether or not the result should appear in the top 10. [0024] Spam features 310 include page-level attributes. The attributes include domain-level features, page-level features, and link information. Values of these features can be determined by mining feature information for each URL in the testing and training sets. Examples of such features include the number of spammy in-links, the top level domain of the site, quality of phrases in the document and density of keywords (spammy terms). The number of spammy in-links is the number of in-links coming from labeled spam pages. The quality of phrases in the document is a score that indicates the quality of terms on the page.
The density of keywords is a score that indicates how many terms on the page are spam terms.
[0025] In addition to spam features 310, feature model 306 includes rank-time features.
Rank-time features are features extracted for use in the ranking algorithm. A large number of web spam pages appear in ranked search results. In order to receive a high rank, web spam pages must contain content that "fools" algorithms used for populating the index and for ranking search results. These algorithms take feature vectors as input, where the feature vectors have a specific distribution over the feature set. The distribution is difficult for a spammer to match without knowledge of how the crawling, indexing, and ranking algorithms work. Even though ranking module 106 believes the web spam page to be highly relevant, classifier 308, with the same feature data as input, but trained on spam labels, should be able to easily identify web spam pages, since they will be outliers of the distribution. Since ranking module 106 is trained to solve a different problem, namely the ordering of relevant pages, not the identification of web spam pages, by using a separate classifier trained to catch web spam, web spam can be demoted and/or removed in the ranked results. [0026] Rank-time features can be separated into query-independent features 312 and query- dependent features 314. The query-independent rank-time features 312 can be grouped into page-level features, domain-level features, anchor features, popularity features, and time features.
[0027] Page-level features are features that can be determined by looking just at a page or URL. Examples of page-level features include static rank features, the count of the most frequent term, the count of the number of unique terms, the total number of terms, the number of words in the path, and the number of words in the title. [0028] Domain-level features are computed as averages across all pages in a domain while other modes of calculation can be used. Examples of domain-level features include the rank of the domain, the average number of words, and the top-level domain. [0029] Popularity features are features that measure the popularity of pages through user data. Popularity features can be derived from toolbar data, where the user has agreed to provide access to data collected during a logged session. The popularity features can include domain- and page-level features. Example features are the number of hits within a domain, the number of users of a domain, the number of hits on a URL, and the number of users of a URL. Time features include the date the URL was crawled, the last date page changed, and the time since the page was crawled. Other features, such as frequent term counts, anchor text features, etc. can also be used. Exemplary rank-time query-independent features are listed in
Table 1.
Page level: static rank, most frequent term, number of unique terms, total number of terms, number of words in path, number of words in title Domain level: domain rank, average number of words, top-level domain
Popularity: domain hits, domain users
Time: date crawled, last change date, time since crawled
Table 1. Rank-time query independent features.
[0030] Query-dependent features 314 are content features that relate to the one or more terms in search query 102. This feature set can include several hundred query-dependent features.
Query-dependent features are generated from the query, document content, and URL. Query-dependent features can depend just on the query or on the match between query and document properties. Examples of query-dependent features include the number of query terms in the title and the frequency of a query term on the page, as well as various counts of the occurrences of the query term across all documents, the number of documents that contain the query term, and n-grams over the query terms and the document. Table 2 lists several example query dependent features that can be used in classifier 308.
Number of query terms in title
Frequency counts of query terms in the page
Frequency counts of query terms over all documents in a domain
Number of documents containing query terms
N-grams over query terms/page
Table 2. Rank-time query-dependent features.
[0031] Although query-dependent features are utilized in classifier 308, a spam label is not necessarily query-dependent. Currently, however, classifier 308 is utilized such that a query is issued and then each returned page is examined to determine if it is spam or not. In instances where different queries yield different results as to whether a page is spam or not can be corrected if desired.
[0032] FIG. 4 is a method 400 for outputting classifier 308 that can be used by web spam classification module 108. At step 402, labeled training instances including a query and a URL are obtained. These instances include both non-spam web pages and spam web pages. A feature based model is obtained at step 404 to identify features of web pages as well as features of relationships between queries and web pages. At step 406, the classifier is trained to identify web spam pages given a query. Furthermore, at step 408, the classifier is output. [0033] Using an extended set of features based on query terms (i.e.) rank-time content can be effective in classifying web spam pages Since one goal of spammers is to generate data to fool search engine ranking algorithms, identifying tell-tale properties of data in web spam pates can be detected by secondary classifiers.
[0034] The above description of illustrative embodiments is described in accordance with a search engine receiving a query and determining if a given page is web spam as a function of the query. Below is a suitable computing environment that can incorporate and benefit from these embodiments, for example within systems 100 and 300 or to perform methods 200 and 400. The computing environment shown in FIG. 5 is one such example that can be used to implement a portion or all of these systems and/or methods. [0035] In FIG. 5, the computing system environment 500 is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality of the claimed subject matter. Neither should the computing environment 500 be interpreted as having any dependency or requirement relating to any one or combination of components illustrated in the exemplary computing environment 500. [0036] Computing environment 500 illustrates a general purpose computing system environment or configuration. Examples of well-known computing systems, environments, and/or configurations that may be suitable for use with the service agent or a client device include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, telephony systems, distributed computing environments that include any of the above systems or devices, and the like.
[0037] Concepts presented herein may be described in the general context of computer- executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Some embodiments are designed to be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules are located in both local and remote computer storage media including memory storage devices. [0038] Exemplary environment 500 for implementing the above embodiments includes a general-purpose computing system or device in the form of a computer 510. Components of computer 510 may include, but are not limited to, a processing unit 520, a system memory 530, and a system bus 521 that couples various system components including the system memory to the processing unit 520. The system bus 521 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus. [0039] Computer 510 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 510 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. [0040] The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. The computer 510 may also include other removable/non-removable volatile/nonvolatile computer storage media. Non-removable non-volatile storage media are typically connected to the system bus 521 through a non-removable memory interface such as interface 540. Removeable non-volatile storage media are typically connected to the system bus 521 by a removable memory interface, such as interface 550.
[0041] A user may enter commands and information into the computer 510 through input devices such as a keyboard 562, a microphone 563, a pointing device 561, such as a mouse, trackball or touch pad, and a video camera 564. These and other input devices are often connected to the processing unit 520 through a user input interface 560 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port or a universal serial bus (USB). A monitor 591 or other type of display device is also connected to the system bus 521 via an interface, such as a video interface 590. In addition to the monitor, computer 510 may also include other peripheral output devices such as speakers 597, which may be connected through an output peripheral interface 595. [0042] The computer 510, when implemented as a client device or as a service agent, is operated in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510. The logical connections depicted in FIG. 5 include a local area network (LAN) 571 and a wide area network (WAN) 573, but may also include other networks. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. [0043] When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation, FIG. 5 illustrates remote application programs 585 as residing on remote computer 580. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between computers may be used.
[0044] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

WHAT IS CLAIMED IS:
1. A method of processing web pages, comprising: receiving a search query including at least one term; comparing the search query to information corresponding to a plurality of web pages; identifying web spam pages from the plurality of web pages as a function of the at least one term in the search query and the information corresponding to the plurality of web pages; and providing a ranked list of web pages from the plurality of web pages as a function of the search query and the identified web spam pages.
2. The method of claim 1 and further comprising: extracting features of the search query as a function of the at least one term and the information corresponding to the plurality of web pages; performing a calculation for each web page based on the extracted features to classify the web page as a web spam page or a non-web spam page.
3. The method of claim 2 wherein the features include at least one of a number of terms in the search query in a title of each web page, a frequency count of terms in the search query that are in each web page, a frequency count of terms in the search query that are in all web pages of a domain, a number of web pages that contain terms in the search query and multiple terms that are in both the search query and each web page.
4. The method of claim 2 wherein the features include features indicative of web spam pages, features independent of the search query and features dependent of the search query.
5. The method of claim 2 wherein the features include page-level, domain-level, popularity and time based features.
6. The method of claim 1 and further comprising: identifying a relevancy list of web pages relevant to the search query based on the comparison of the search query to information corresponding to the plurality of web pages; and updating the relevancy list based on the identified web spam pages to provide the ranked list.
7. The method of claim 6 wherein updating includes one of demoting web spam pages in the relevancy list and removing web spam pages from the relevancy list.
8. A system of processing web pages, comprising: a search engine module adapted to receive a search query including at least one term; a ranking module adapted to compare the search query to information corresponding to a plurality of web pages; a web spam classification module adapted to identify web spam pages from the plurality of web pages as a function of the at least one term in the search query and the information corresponding to the plurality of web pages; and an output module adapted to provide a ranked list of web pages from the plurality of web pages as a function of the search query and the identified web spam pages.
9. The system of claim 8 and further comprising wherein the ranking module is adapted to extract features of the search query as a function of the at least one term in the search query and the information corresponding to the plurality of web pages and wherein the web spam classification module is adapted to perform a calculation for each web page based on the extracted features to classify the web page as a web spam page or a non- web spam page.
10. The system of claim 9 wherein the features include at least one of a number of terms in the search query in a title of each web page, a frequency count of terms in the search query that are in each web page, a frequency count of terms in the search query that are in all web pages of a domain, a number of web pages that contain terms in the search query and multiple terms that are in both the search query and each web page.
11. The system of claim 9 wherein the features include features indicative of web spam pages, features independent of the search query and features dependent of the search query.
12. The system of claim 9 wherein the features include page-level, domain-level, popularity and time based features.
13. The system of claim 8 wherein the ranking module is adapted to identify a relevancy list of web pages relevant to the search query based on the comparison of the search query to information corresponding to the plurality of web pages and wherein the web spam classification module is adapted to update the relevancy list based on the identified web spam pages to provide the ranked list.
14. The system of claim 13 wherein the web spam classification module performs one of demoting web spam pages in the relevancy list and removing web spam pages from the relevancy list.
15. The system of claim 8 wherein the information corresponding to the plurality of web pages is stored in an index accessible by the ranking module.
16. A training system for constructing a classifier, comprising: a set of training instances, each instance including a query identifier and a page identifier; a feature model identifying features related to web pages and related to relationships between the query identifier and information in the web pages; and a training module adapted to access the set of training instances and the feature model to output a classifier for labeling a given web page as spam or non- spam.
17. The training system of claim 16 wherein the training module utilizes a support vector machine to identify a hyperplane as a function of the features to construct the classifier.
18. The training system of claim 16 wherein the features include features indicative of web spam pages, features independent of the search query and features dependent of the search query.
19. The training system of claim 16 wherein the features include at least one of a number of terms in the search query in a title of each web page, a frequency count of terms in the search query that are in each web page, a frequency count of terms in the search query that are in all web pages of a domain, a number of web pages that contain terms in the search query and multiple terms that are in both the search query and each web page.
20. The training system of claim 16 wherein the features include page-level, domain- level, popularity and time based features.
PCT/US2008/058637 2007-04-30 2008-03-28 Web spam page classification using query-dependent data WO2008134172A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/742,156 2007-04-30
US11/742,156 US7853589B2 (en) 2007-04-30 2007-04-30 Web spam page classification using query-dependent data

Publications (1)

Publication Number Publication Date
WO2008134172A1 true WO2008134172A1 (en) 2008-11-06

Family

ID=39888207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/058637 WO2008134172A1 (en) 2007-04-30 2008-03-28 Web spam page classification using query-dependent data

Country Status (4)

Country Link
US (1) US7853589B2 (en)
CL (1) CL2008001189A1 (en)
TW (1) TWI437452B (en)
WO (1) WO2008134172A1 (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BRPI0713830A2 (en) * 2006-07-24 2017-10-17 Chacha Search Inc "computer readable method for controlling a computer including a guide database, computer readable memory for controlling a computer including a video and system training database"
US7680745B2 (en) * 2007-01-29 2010-03-16 4Homemedia, Inc. Automatic configuration and control of devices using metadata
US8458165B2 (en) * 2007-06-28 2013-06-04 Oracle International Corporation System and method for applying ranking SVM in query relaxation
US8078617B1 (en) * 2009-01-20 2011-12-13 Google Inc. Model based ad targeting
US8346800B2 (en) * 2009-04-02 2013-01-01 Microsoft Corporation Content-based information retrieval
US8219539B2 (en) * 2009-04-07 2012-07-10 Microsoft Corporation Search queries with shifting intent
US8935258B2 (en) * 2009-06-15 2015-01-13 Microsoft Corporation Identification of sample data items for re-judging
TWI601024B (en) * 2009-07-06 2017-10-01 Alibaba Group Holding Ltd Sampling methods, systems and equipment
US20110040769A1 (en) * 2009-08-13 2011-02-17 Yahoo! Inc. Query-URL N-Gram Features in Web Ranking
US9020936B2 (en) * 2009-08-14 2015-04-28 Microsoft Technology Licensing, Llc Using categorical metadata to rank search results
US9576251B2 (en) * 2009-11-13 2017-02-21 Hewlett Packard Enterprise Development Lp Method and system for processing web activity data
TWI404374B (en) * 2009-12-11 2013-08-01 Univ Nat Taiwan Science Tech Method for training classifier for detecting web spam
US8639773B2 (en) * 2010-06-17 2014-01-28 Microsoft Corporation Discrepancy detection for web crawling
US8706738B2 (en) * 2010-08-13 2014-04-22 Demand Media, Inc. Systems, methods and machine readable mediums to select a title for content production
JP4939637B2 (en) * 2010-08-20 2012-05-30 楽天株式会社 Information providing apparatus, information providing method, program, and information recording medium
US8606769B1 (en) * 2010-12-07 2013-12-10 Conductor, Inc. Ranking a URL based on a location in a search engine results page
US8762365B1 (en) * 2011-08-05 2014-06-24 Amazon Technologies, Inc. Classifying network sites using search queries
US8655883B1 (en) * 2011-09-27 2014-02-18 Google Inc. Automatic detection of similar business updates by using similarity to past rejected updates
KR101510647B1 (en) * 2011-10-07 2015-04-10 한국전자통신연구원 Method and apparatus for providing web trend analysis based on issue template extraction
US9244931B2 (en) * 2011-10-11 2016-01-26 Microsoft Technology Licensing, Llc Time-aware ranking adapted to a search engine application
US8868536B1 (en) 2012-01-04 2014-10-21 Google Inc. Real time map spam detection
US9659095B2 (en) 2012-03-04 2017-05-23 International Business Machines Corporation Managing search-engine-optimization content in web pages
CN102801709B (en) * 2012-06-28 2015-03-04 北京奇虎科技有限公司 Phishing website identification system and method
US9483566B2 (en) * 2013-01-23 2016-11-01 Google Inc. System and method for determining the legitimacy of a listing
US9405803B2 (en) * 2013-04-23 2016-08-02 Google Inc. Ranking signals in mixed corpora environments
US20150039599A1 (en) * 2013-08-01 2015-02-05 Go Daddy Operating Company, LLC Methods and systems for recommending top level and second level domains
WO2016115319A1 (en) * 2015-01-15 2016-07-21 The University Of North Carolina At Chapel Hill Methods, systems, and computer readable media for generating and using a web page classification model
US10229219B2 (en) * 2015-05-01 2019-03-12 Facebook, Inc. Systems and methods for demotion of content items in a feed
US11675795B2 (en) * 2015-05-15 2023-06-13 Yahoo Assets Llc Method and system for ranking search content
WO2020106451A1 (en) 2018-11-20 2020-05-28 Google Llc Methods, systems, and media for modifying search results based on search query risk

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040072059A (en) * 2003-02-08 2004-08-18 디프소프트 주식회사 Method for automatically blocking spam mail by connection of link url
KR20040103763A (en) * 2004-01-15 2004-12-09 엔에이치엔(주) A method of managing web sites registered in search engine
US20060095416A1 (en) * 2004-10-28 2006-05-04 Yahoo! Inc. Link-based spam detection
US20070038600A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Detecting spam related and biased contexts for programmable search engines

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5826260A (en) * 1995-12-11 1998-10-20 International Business Machines Corporation Information retrieval system and method for displaying and ordering information based on query element contribution
US7117358B2 (en) * 1997-07-24 2006-10-03 Tumbleweed Communications Corp. Method and system for filtering communication
US6785671B1 (en) * 1999-12-08 2004-08-31 Amazon.Com, Inc. System and method for locating web-based product offerings
US6480837B1 (en) * 1999-12-16 2002-11-12 International Business Machines Corporation Method, system, and program for ordering search results using a popularity weighting
US7188106B2 (en) * 2001-05-01 2007-03-06 International Business Machines Corporation System and method for aggregating ranking results from various sources to improve the results of web searching
US6795820B2 (en) * 2001-06-20 2004-09-21 Nextpage, Inc. Metasearch technique that ranks documents obtained from multiple collections
US6978274B1 (en) * 2001-08-31 2005-12-20 Attenex Corporation System and method for dynamically evaluating latent concepts in unstructured documents
US7219148B2 (en) * 2003-03-03 2007-05-15 Microsoft Corporation Feedback loop for spam prevention
US7197497B2 (en) * 2003-04-25 2007-03-27 Overture Services, Inc. Method and apparatus for machine learning a document relevance function
US20050015626A1 (en) * 2003-07-15 2005-01-20 Chasin C. Scott System and method for identifying and filtering junk e-mail messages or spam based on URL content
US20050120019A1 (en) * 2003-11-29 2005-06-02 International Business Machines Corporation Method and apparatus for the automatic identification of unsolicited e-mail messages (SPAM)
US20050216564A1 (en) * 2004-03-11 2005-09-29 Myers Gregory K Method and apparatus for analysis of electronic communications containing imagery
US7349901B2 (en) * 2004-05-21 2008-03-25 Microsoft Corporation Search engine spam detection using external data
US7664819B2 (en) * 2004-06-29 2010-02-16 Microsoft Corporation Incremental anti-spam lookup and update service
US7580921B2 (en) * 2004-07-26 2009-08-25 Google Inc. Phrase identification in an information retrieval system
US7716198B2 (en) * 2004-12-21 2010-05-11 Microsoft Corporation Ranking search results using feature extraction
US7962510B2 (en) * 2005-02-11 2011-06-14 Microsoft Corporation Using content analysis to detect spam web pages
US7562304B2 (en) * 2005-05-03 2009-07-14 Mcafee, Inc. Indicating website reputations during website manipulation of user information
US7769751B1 (en) * 2006-01-17 2010-08-03 Google Inc. Method and apparatus for classifying documents based on user inputs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20040072059A (en) * 2003-02-08 2004-08-18 디프소프트 주식회사 Method for automatically blocking spam mail by connection of link url
KR20040103763A (en) * 2004-01-15 2004-12-09 엔에이치엔(주) A method of managing web sites registered in search engine
US20060095416A1 (en) * 2004-10-28 2006-05-04 Yahoo! Inc. Link-based spam detection
US20070038600A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Detecting spam related and biased contexts for programmable search engines

Also Published As

Publication number Publication date
US20080270376A1 (en) 2008-10-30
CL2008001189A1 (en) 2008-12-26
TW200849045A (en) 2008-12-16
TWI437452B (en) 2014-05-11
US7853589B2 (en) 2010-12-14

Similar Documents

Publication Publication Date Title
US7853589B2 (en) Web spam page classification using query-dependent data
US8051080B2 (en) Contextual ranking of keywords using click data
US7620631B2 (en) Pyramid view
KR101230687B1 (en) Link-based spam detection
Ye et al. Sentiment classification for movie reviews in Chinese by improved semantic oriented approach
Wang et al. Evaluating contents-link coupled web page clustering for web search results
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US8712999B2 (en) Systems and methods for online search recirculation and query categorization
US20120265757A1 (en) Ranking blog documents
US20060026152A1 (en) Query-based snippet clustering for search result grouping
WO2007127676A1 (en) System and method for indexing web content using click-through features
Al-asadi et al. A survey on web mining techniques and applications
Cheng et al. Fuzzy matching of web queries to structured data
Liu et al. Topical Web Crawling for Domain-Specific Resource Discovery Enhanced by Selectively using Link-Context.
KR102169143B1 (en) Apparatus for filtering url of harmful content web pages
Priyatam et al. Domain specific search in indian languages
Preetha et al. Personalized search engines on mining user preferences using clickthrough data
Yang et al. Clustering of web search results based on combination of links and in-snippets
Deogun et al. Structural abstractions of hypertext documents for web-based retrieval
TWI483129B (en) Retrieval method and device
Sugiyama et al. A method of improving feature vector for web pages reflecting the contents of their out-linked pages
Berlocher et al. TopicRank: bringing insight to users
EP2662785A2 (en) A method and system for non-ephemeral search
Gong et al. An implementation of web image search engines
Choudhary et al. Various link algorithms in web mining

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08744590

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08744590

Country of ref document: EP

Kind code of ref document: A1