CN102890702A - Internet forum-oriented opinion leader mining method - Google Patents

Internet forum-oriented opinion leader mining method Download PDF

Info

Publication number
CN102890702A
CN102890702A CN2012102501161A CN201210250116A CN102890702A CN 102890702 A CN102890702 A CN 102890702A CN 2012102501161 A CN2012102501161 A CN 2012102501161A CN 201210250116 A CN201210250116 A CN 201210250116A CN 102890702 A CN102890702 A CN 102890702A
Authority
CN
China
Prior art keywords
opinion
leader
forum
comment
utilize
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012102501161A
Other languages
Chinese (zh)
Inventor
葛斌
李芳芳
汤大权
蒋林承
唐九阳
王桢文
胡升泽
戴长华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN2012102501161A priority Critical patent/CN102890702A/en
Publication of CN102890702A publication Critical patent/CN102890702A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses an Internet forum-oriented opinion leader mining method. An opinion leader mining system is involved in the method and comprises a computing center and a database server which communicates with the computing center. The method comprises the following steps of: capturing forum data by using a crawler, and improving data processing real-time property by using message-oriented middleware (MOM); extracting web page information, performing word segmentation by using a Chinese word segmentation system, and filtering spam comments by a spectral clustering method; analyzing text tendency by using an emotional corpus; setting a selection standard value of an opinion leader, and determining the opinion leader; and visualizing a result. By the method, the opinion leader in a forum can be accurately mined, and technical support is provided for related Internet public opinion supervision departments to timely find hot issues and guide the healthy development of Internet public opinions.

Description

Leader of opinion's method for digging of a kind of network-oriented forum
Technical field
The present invention relates to the internet information management domain, particularly leader of opinion's method for digging of a kind of network-oriented forum.
Background technology
Along with the fast development of network technology and the rapid growth of netizen's scale, the increasing public participates in society by network and discusses and express social suggestion.Because the internet has the equality of interchange, participates in widely characteristics, many domestic and international focus incidents can form rapidly huge network public opinion pressure, and network has become one of main carriers of reflection Social Public Feelings.
In the forming process of network public-opinion, leader of opinion's booster action is remarkable.The people who can propose guiding opinion, has an extensive social influence is the leader of opinion, claims again opinion leader.The leader of opinion has accumulated higher popularity in the network forum, the public sentiment event ferment with sweat in, the public sentiment main body more easily is subject to leader of opinion's impact, leader of opinion's speech and suggestion tend to affect and change other people suggestion, guiding and the promotion state of affairs further develop, their effect in generation, development and the extinction process of network public-opinion may be positive, also may be passive, thereby leader of opinion's excacation is had important practical significance.Yet in today of internet data explosive growth, traditional dependence complicate statistics data find that leader of opinion's method seems unable to do what one wishes.
Summary of the invention
Technical matters to be solved by this invention is, not enough for prior art, leader of opinion's method for digging of a kind of network-oriented forum is provided, excavate exactly the leader of opinion in the forum, for the timely discovering hot problem of network of relation public sentiment supervision department, the sound development of which directs network public sentiment provide technical support.
For solving the problems of the technologies described above, the technical solution adopted in the present invention is: leader of opinion's method for digging of a kind of network-oriented forum, comprise leader of opinion's digging system, leader of opinion's digging system comprises computing center and database server, database server is communicated by letter with computing center, and the concrete steps of the method are:
1) utilizes crawler capturing forum data, and utilize message-oriented middleware to improve the real-time that data are processed;
2) extract info web, utilize Chinese automatic word-cut to carry out participle, and utilize Spectral Clustering that rubbish is commented on and filter;
3) utilize Emotional Corpus to carry out text tendency analysis;
4) setting leader of opinion's selection standard value, utilize following formula to determine the leader of opinion:
PR(A)=(1-d)+d*(PR(T1)?*?L(A,T1)+…+PR(Tn)?*L(A,Tn)),
Wherein: the score of the given page A of PR (A) expression, d is damping factor, itself the score of website of A page or leaf is pointed in one of PR (Tn) expression, L (A, Tn) the link degree of correlation of expression webpage A and webpage Tn, L (A, Tn)=(Ua ∩ Un)/(Ua ∪ Un), the chain that Ua represents webpage A goes out, chain enters and the set of the URL of self; The chain that Un represents webpage Tn goes out, chain enters and the set of the URL of self;
As PR (A) during greater than the selection standard value set, namely be defined as the leader of opinion;
5) with 4) in result visualization.
As preferred version, in the step 1), utilize the reptile Netcrawler that increases income to realize the network forum data acquisition; Described message-oriented middleware is ActiveMQ.
As preferred version, step 2) in, utilize regular expression to extract info web; Described Chinese automatic word-cut is the ICTCLAS of Chinese lexical analysis system based on the multilayer hidden Markov model.
Step 2) in, utilize the Spectral Clustering step that comment is filtered to rubbish to be:
1) gathers the Text Intelligence language material, text is carried out pre-service, obtain comment collection;
2) for the comment collection that obtains, utilize vector space model to carry out character representation every comment, every comment is expressed as a vector in space;
3) generate similar matrix G;
4) the non-canonical Laplacian Matrix of structure is as sample matrix: utilize similarity matrix GDraw adjacency matrix W, then each column element of adjacency matrix being added up obtains NNumber is placed on them on the diagonal line, and other places all are zero, form one N*NMatrix, be designated as D, the order L=D – W, LBe sample matrix;
5) construction feature vector space: obtain LBefore kIndividual eigenwert and characteristic of correspondence vector are with this kIndividual proper vector forms one N*kMatrix, be characteristic vector space, wherein before kIndividual eigenwert is arranged from small to large according to the size of eigenwert;
6) this kOne of individual characteristic series vector composition arranged together N*kMatrix, will be wherein every delegation regard as kA vector in the dimension space, and use spectral clustering to carry out cluster;
7) adopt the Euclidean distance method, calculate each point on the spectral clustering basis to corresponding class centre distance;
8) degree of peeling off of calculating object: with the mean value of above-mentioned distance E (Xi)And variance E (Xi-E (Xi)) 2 Ratio as the basic data of degree of peeling off, then according to degree of peeling off formula Out (i)=E (Xi)/E (Xi-E (Xi)) 2 Calculating degree of peeling off;
9) utilize degree of peeling off to detect the rubbish comment: the rubbish comment as outlier, is then carried out outlier detection, only need degree of peeling off is carried out the Top-n ordering, the object that degree of peeling off is the highest is exactly outlier, namely detected rubbish comment;
10) detected rubbish comment is deleted from database.
As preferred version, in the step 3), the Emotional Corpus of employing is the sentiment analysis word set among the HowNet201104.
As preferred version, in the step 4), the standard value of choosing is 0.001; Damping factor d is made as 0.8.
As preferred version, in the step 5), utilize Vizster to realize visual.
The present invention comments on towards the magnanimity in the Internet forum, designed and Implemented network forum leader of opinion's automatic mining system, can excavate exactly the leader of opinion in the forum, for the timely discovering hot problem of network of relation public sentiment supervision department, the sound development of which directs network public sentiment provide technical support.
Description of drawings
Fig. 1 is one embodiment of the invention leader of opinion automatic mining system hardware platform structure schematic diagram;
Fig. 2 is one embodiment of the invention reptile configuration interface figure;
Fig. 3 is that one embodiment of the invention is utilized the Spectral Clustering flow chart of steps that comment is filtered to rubbish;
Fig. 4 is the basic metacharacter of regular expression;
Fig. 5 is the sentiment analysis word collection of HowNet.
Embodiment
As shown in Figure 1, one embodiment of the invention leader of opinion's automatic mining system mainly is comprised of data center and computing center, the data that data center is a database server stores crawls from internet forum, and for computing center provides data, services, thereby the data that computing center then provides the data center by series of algorithms are processed and are excavated the leader of opinion.
One embodiment of the invention leader of opinion's automatic mining method step is as follows:
1, gathers forum's data
(1) utilizes the crawler capturing forum data of increasing income
The network forum data acquisition is the data basis that the leader of opinion excavates, the utilization of the present invention reptile Netcrawler that increases income realizes the network forum data acquisition, this is a front end system that Web creeps, can be along link roaming Web collection of document, its basic functional principle is identical with the reptile that carries out BFS (Breadth First Search) based on seed URL: by given seed URL, utilize the standard agreements such as HTTP to read respective document, then with all were not accessed in the document URL as new starting point, proceed roaming, until the new URL that does not satisfy condition.The major function of Netcrawler is all webpages of downloading in the territory, then analyze and process all coherent elements, comprise picture, literal, audio frequency, video etc., can download up-to-date Netcrawler version at http://freecode.com/projects/netcrawler.Network address by given forum homepage network address or a certain channel of forum is obtained the webpage source code that comprises theme and the money order receipt to be signed and returned to the sender page in this forum's channel as seed URL, and concrete money order receipt to be signed and returned to the sender metadata is resolved to be operated in the webpage pre-service and carried out.The present invention has kept system kernel, and the key element that forum is gathered (comprise title, text, the people that posts, the time of posting, ip address, clicks, answer number etc.) configures, and changing system interface, to make it operation more friendly.The network forum seed URL crawl webpage that provides according to the user and relevant model also store this locality into, for system append and more new data raw data is provided.
Reptile configuration interface such as Fig. 2 are provided with seed URL, Thread Count, preservation position, the crawl degree of depth, crawl type and keyword and express formula.
(2) utilize message-oriented middleware to improve the real-time that data are processed
Message-oriented middleware (MOM, Message-Oriented Middleware) is a kind of specific middleware.It utilizes the message passing mechanism of high efficient and reliable to carry out the data exchange of platform independence, and based on data communication to carry out distributed system integrated.The most important function of message-oriented middleware is exactly that reliable message communicating means in time are provided.In order to finish the transmitting of message, the mode of normal operation formation is carried out message management, that is to say, usually when carrying out data transmission, with data according to user-defined size, split into some message elements and put into message queue, middleware can send or receipt message according to synchronous or asynchronous communication mode.In the operating process of reality, in order to ensure the message transmitting, often use the technology such as message priority, breakpoint transmission, reliable news formation, memory queue, some has also added flow control, has built the functions such as connection in advance.The core essence of message-oriented middleware is the message transmission, the message transmission be a kind ofly can support at a high speed, asynchronous, program is to the technology of program reliable communication.
That the present embodiment is used is message-oriented middleware ActiveMQ.ActiveMQ is that Java messenger service (JMS) open standard is issued and realized to an open source code based on Apache 2.0 licenced, be that a kind of measured message transmits solution, can download up-to-date ActiveMQ at http://activemq.apache.org/download.html.System is in the crawl webpage, webpage URL, webpage crawl time and this store path of webpage are sent to the message queue of ActiveMQ as message, by the message synchronization pass through mechanism among the ActiveMQ, info web in the formation is passed to the webpage preprocessing process in real time, realize the new web page that timely perception gets access to, thereby new web page is processed.In addition, ActiveMQ also provides system to dispose required interoperability, security, scalability, availability, ease of manageability and other functions.
After having grasped original web page, in order to obtain the comment in the webpage, also to carry out info web and extract.Because html language as a kind of semi-structured descriptive language, is paid close attention to more the visual effect (font, size, color, position etc.) that represents content, and has been ignored the institutional framework of content.After the record in the database is formatted into html web page, lost structural information in description.The purpose browsed of user usually include a plurality of models in a content page, and each model visually is rendered as relatively independent message block for convenience.This characteristic is reflected on the html document structural sheet of webpage, just show as each message block corresponding to a relatively independent DOM Document Object Model (Document Object Model, DOM) subtree, all message block subtrees all are positioned under the same father node, and have identical internal structural characteristic.The comment webpage generally adopts form to hold data, and form is comprehensive HTML structure, and the html tag of mainly using has<TABLE 〉,<TH 〉,<TR 〉,<TD 〉, utilize these labels can specify the alignment placement of form.
2, extract info web, utilize Chinese automatic word-cut to carry out participle, and utilize Spectral Clustering that rubbish is commented on and filter
(1) extracts info web
After having grasped original web page, in order to obtain the comment in the webpage, also to carry out info web and extract.Because html language as a kind of semi-structured descriptive language, is paid close attention to more the visual effect (font, size, color, position etc.) that represents content, and has been ignored the institutional framework of content.After the record in the database is formatted into html web page, lost structural information in description.The purpose browsed of user usually include a plurality of models in a content page, and each model visually is rendered as relatively independent message block for convenience.This characteristic is reflected on the html document structural sheet of webpage, just show as each message block corresponding to a relatively independent DOM Document Object Model (Document Object Model, DOM) subtree, all message block subtrees all are positioned under the same father node, and have identical internal structural characteristic.The comment webpage generally adopts form to hold data, and form is comprehensive HTML structure, and the html tag of mainly using has<TABLE 〉,<TH 〉,<TR 〉,<TD 〉, utilize these labels can specify the alignment placement of form.
Take out message from message queue, extract the Web Page Metadata that draws and will be stored in the database table, the comment content of extraction is stored into local hard drive by the form with document, and shines upon one by one with the record maintenance in the database table.The metadata of webpage comprises URL, source web, local store path, review title, utterer ID name, access number, the answer number of comment and delivers the time, local store path is as the main identification code of webpage, is used for the related of local comment document and data base table recording.
Based on the markup language characteristic of HTML, can utilize regular expression (Regular Expression) to carry out useful information and extract.Regular expression is mainly used in text based search and editor, can realize data verification, text replacement, can also extract substring from character string according to pattern match.Fig. 4 has described several metacharacters commonly used in the regular expression.The present invention utilizes the regular expression of C# language in Visual Studio 2008 platforms to extract useful information from the webpage source code.
Extract text message in the webpage by matching regular expressions, comprise web page title, author, the content of posting, money order receipt to be signed and returned to the sender person, money order receipt to be signed and returned to the sender content etc.Deposit the text message after extracting in the oracle database, the database version that the present invention uses is oracle 10g.
(2) Chinese word segmentation
Participle is the crucial and basic of text mining, the present embodiment has used the free version of magnanimity participle development interface, this software is present widely used a kind of Chinese automatic word-cut, the ICTCLAS(Institute of Computing Technology of Chinese lexical analysis system based on the multilayer hidden Markov model by Inst. of Computing Techn. Academia Sinica's development, Chinese Lexical Analysis System), higher participle accuracy rate is not only arranged, and participle efficient is also better.Its major function comprises Chinese word segmentation, part-of-speech tagging, named entity recognition, neologisms identification, supports simultaneously user-oriented dictionary, supports Chinese-traditional, supports GBK, UTF-8, the Multi-encoding forms such as UTF-7, UNICODE.Present magnanimity Words partition system participle speed unit 996KB/s, the precision of word segmentation 98.45%, API is no more than 200KB, various dictionary data compressions are rear less than 3M, and ICTCLAS all adopts C/C++ to write, supporting Linux, FreeBSD and Windows series operating system, support the development language of C/C++, C#, the main flows such as Delphi, Java, is current preferably Chinese lexical analysis device.This development interface is comprised of HLSSplit.dll routine interface and HLSSplit.dll.dat corpus two parts, can pass through the internet Free Acquisition.
The present embodiment is by calling the HLSSplit.dll of magnanimity participle, based on magnanimity HLSSplit.dll.dat corpus, text is carried out participle, obtain the information such as Chinese vocabulary, part-of-speech tagging, vocabulary position, word frequency statistics, the result behind the participle is used for the rubbish comment and filters.
(3) filter based on the rubbish comment of cluster
It is nonsensical that a lot of answers are arranged in the money order receipt to be signed and returned to the sender content that coupling obtains, such as small advertisement, incoherent answer, repeat to reply etc., in order effectively to excavate the leader of opinion, the person that should analyze the money order receipt to be signed and returned to the sender is to the significant reply content of posting person, with the support of this person that obtains the money order receipt to be signed and returned to the sender to the posting person.Thereby must filter the rubbish comment.
Rubbish comment based on cluster is filtered key step as shown in Figure 3:
1) for the comment collection that obtains, utilize vector space model to carry out character representation every comment, every comment is expressed as a vector in space.The present invention adopts vector space model that text is represented, in vector space model, and a text d i Be counted as one by a stack features (t 1 , t 2 ..., t n )Form nDimensional vector, like this text d i Be reduced to the vector representation take the weight of feature as component (w I1 , w I2 ..., w In ), weight w Ij Representation feature t j =(j=1,2 ... n)To text d i The significance level of classification can be with text d i Be expressed as d j =(w I1 , w I2 ..., w In )
Wherein t In Be characteristic item, it can be word, word or phrase; w In Be characteristic item t In Weight, the expression t In Significance level in comment takes the method TF-IDF of the Determining Weights relatively commonly used to represent significance level herein.
2) generation of similar matrix.Similarity matrix has comprised the in twos similarity of comment, therefore every comment with vector representation after, need to utilize cosine formula that similarity is carried out in comment and calculate two comments d i With d j Between similarity Sim (d i , d j )Its calculation procedure is as follows:
The similarity measurement computing method
Input: two text vectors d i With d j
Output: similarity matrix G
Wherein, the similarity between two documents can represent with the included angle cosine between the vector of its correspondence, i.e. document d iAnd d jSimilarity can be expressed as: Sim (d i , d j )=di*dj/|di|*|dj|
3) the non-canonical Laplacian Matrix of structure is as sample matrix.At first utilize similarity matrix GDraw adjacency matrix W, then each column element of adjacency matrix being added up obtains NNumber is placed on (other places all are zero) on the diagonal line to them, forms one N*NMatrix, be designated as D.Order L=D – W, LBe sample matrix.
4) construction feature vector space.Construction feature vector space demand goes out LBefore kIndividual eigenwert (in this article, unless specified otherwise, " front kIndividual " refer to the size order from small to large according to eigenwert) and the characteristic of correspondence vector, with this kIndividual proper vector forms one N*kMatrix be characteristic vector space.
5) this kOne of the vectorial composition arranged together of individual feature (row) N*kMatrix, will be wherein every delegation regard as kA vector in the dimension space, and use spectral clustering to carry out cluster.
6) calculate each point on the spectral clustering basis to corresponding class centre distance.This paper adopts the method for Euclidean distance to corresponding class distance calculating method to each point.If two points A=(a[1], a[2] ..., a[n])With B=(b[1], b[2] ..., b[n])Between distance ρ (A, B)Be defined as following formula:
ρ(A,B)?=sqrt?[?∑(?a[i]?-?b[i]?)^2?]?(i?=?1,2,…,n)
7) degree of peeling off of calculating object.Calculate first the mean value of above-mentioned distance E (Xi)And variance E (Xi-E (Xi)) 2 Ratio as the basic data of degree of peeling off, then according to degree of peeling off formula Out (i)=E (Xi)/E (Xi-E (Xi)) 2 Calculating degree of peeling off.
8) utilize degree of peeling off to detect the rubbish comment.Detection method based on spectral clustering (Spectral Clustering), the rubbish comment as outlier, is then carried out outlier detection, only need degree of peeling off is carried out the Top-n ordering, the object that degree of peeling off is the highest is exactly outlier, namely detected rubbish comment.
Determining a certain comment is after rubbish is commented on, and record that this comment is corresponding is deleted from database.
3, analyze Text Orientation
During the analysis user Relations Among, except basic annexation, it is core content that a user calculates another user's support, carries out based on sentiment classification by the content of text to model and can improve the accuracy that support is analyzed.Text tendency analysis refers to by subjective informations such as the view in excavation and the analysis text, tastes, and the emotion tendency of text is made classification judge.
The method of text tendency analysis is more at present, the present invention has carried out the comment of Chinese word segmentation and rubbish in text pre-service link and has filtered, for leader of opinion's characteristics, relatively low to the computational accuracy requirement of based on sentiment classification, and being carried out efficient, algorithm has higher requirement.The present invention is based on Emotional Corpus and carry out based on sentiment classification, the Emotional Corpus that adopts is the sentiment analysis word set among the HowNet201104, this corpus is Free Acquisition on the internet, and the Chinese word of choosing wherein makes up polarity dictionary, and the word collection scale that obtains after gathering as shown in Figure 5.
Forum's Emotional Corpus of setting up has provided the score value that quantizes of each emotion word.For example: the score value of " good " is 1, and the score value of " fine " is 2, and " poor " score value is-1, and the score value of " very poor " is-2.
Just can carry out tendentiousness marking to every comment after setting up corpus, for example comment for " building-owner analyzes finely; but have a bit extreme ", " fine " and " extreme " all expressed money order receipt to be signed and returned to the sender person's emotion tendency in the words, " fine " score value is 2, " extreme " score value is-1, and the score value that obtains this comment after the weighted sum is 1.After drawing marking this score value is inserted in " tendentiousness score value " field of this comment corresponding record in the database.
4, excavate the leader of opinion
Traditional Pagerank algorithm:
Pagerank relies on be user name to the supporting rate of website, utilize a large amount of link structures to show the value of certain independent webpage.It similarly is a ballot of being initiated by the every other page on the internet, and decides the importance of a page with this.The link of certain page of sensing represents one and takes ticket, if link is pointed to it, that just is equivalent to not prop up take ticket.
The value of Pagerank is defined as follows: supposition page A has T1 ... it (is T1 that these pages of Tn point to it ... Tn reference page A).Parameter d is a damping parameter that is arranged between 0 and 1.In addition, C (A) is defined as the number of links from A.Then the Pagerank value of webpage A is drawn by following formula.
PR(A)=(1-d)+d*(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)),
Wherein, the score of the given page A of PR (A) expression; D is damping factor, usually is made as 0.8; Itself the score of website of A page or leaf is pointed in one of PR (Tn) expression; C (T1) represents that the chain that this page has goes out quantity; The page quantity of A page or leaf and the ratio of the page quantity that A points to other pages are pointed in PR (Tn)/C (Tn) expression.The formation probability distribution of Pagerank value in whole webpage colony is so all the Pagerank value sum of webpages is 1.
The chain that the external linkage B of A page or leaf can bring the score of A and B goes out quantity and is inversely proportional to, and namely along with the B cochain goes out several increases, the score that brings A decreases.This shows that equally the score of a webpage is that this webpage is to a basic metric form of other webpage ballots.A webpage can be voted to one or more derivation links, but its total voting right is certain, and is divided equally to all derivation links.That supposes B must be divided into 5, and only has a link of pointing to A on the B, and A will obtain the whole score of B so, and B does not lose anything, and A has won the score of B.But if n link arranged on the B, then A could only obtain the n of the score of B/one.
Improved Pagerank algorithm:
Be not difficult to find out from the definition of conventional P agerank algorithm, in fact it suppose to be linked to all webpage Ti of webpage A, and the probability that jumps to A from Ti along link is the inverse of Ti web page interlinkage out-degree, supposes that namely the probability that Ti jumps to any webpage is identical.This does not conform to actual conditions, and in forum, a user can have a lot of chains to go out, and can support the model that a lot of people deliver, but the weight of these links is different certainly, and it is relevant with money order receipt to be signed and returned to the sender person and posting person's the degree of correlation.Can portray this weight with the link degree of correlation, be expressed as follows:
L(A,Ti)=(Ua∩Ui)/(?Ua∪Ui),
Wherein, the link degree of correlation of L (A, Ti) expression webpage A and webpage Ti; The chain that Ua represents webpage A goes out, chain enters and the set of the URL of self; The chain that Ui represents webpage Ti goes out, chain enters and the set of the URL of self.In the network forum, this method is equivalent to regard original direct net as indirected net, expression be the ratio that the number of posting person and the money order receipt to be signed and returned to the sender person model of replying the another one posting person and posting person and money order receipt to be signed and returned to the sender person deliver the model sum, ratio is larger, both correlativitys are larger.
Thereby obtain revised formula:
PR(A)=(1-d)+d*(PR(T1)?*?L(A,T1)+…+PR(Tn)?*L(A,Tn)),
Wherein, the score of the given page A of PR (A) expression; D is damping factor, is made as 0.8; Itself the score of website of A page or leaf is pointed in one of PR (Tn) expression; The link degree of correlation of L (A, Tn) expression webpage A and webpage Tn.
Improved Pagerank algorithm points for attention:
Programme according to algorithm and just can calculate the score of user in the forum, score the higher person just can be regarded as the leader of opinion.But should be noted that following problem in this algorithmic procedure concrete the use:
(1) setting of iteration convergence error
The Pagerank algorithm is an iteration convergence algorithm, so determine convergence error, the setting of error needs certain skill, set too muchly, result's accuracy is doubtful, sets too smallly, the required time of obtaining a result is long, sometimes within observing time even can not get the result, through test of many times, the setting data error is 0.001.
(2) leader of opinion's selection standard
Here said standard is actually a numerical value, needs namely to consider how many scores namely can be considered the leader of opinion greater than, and this is difficult to obtain a unified standard.In the ends of the earth Economic Forum of investigating, one has more than 20000 user, and all user's total points are 1, through test of many times, it is more rational as standard figures that discovery gets 0.001, chooses this leader of opinion who is worth going out and all meets by analysis location to the leader of opinion.
(3) solution of grade sinking phenomenon
If running into several isolated nodes in algorithm uses votes mutually, suppose that namely user A and user B have supported respectively the other side, and user A and user B do not obtain the support of other users in the forum, when carrying out computing, both branches that gets are stabilized in a high value even can be increase tendency, final score can be higher than selection standard like this, thereby thinks user A and B by mistake leader of opinion, and this phenomenon is called the grade sinking.Needing isolated node is rejected when meeting with the grade sinking, is the user who is considered to cause the grade sinking if read a user when algorithm is realized, so directly it is skipped, and this user can not enter the algorithm matrix, has so just well avoided the grade sinking.
The concrete application of improved Pagarank algorithm:
Structural matrix at first before execution algorithm is excavated the leader of opinion, suppose that there be n user in forum, so just, construct the matrix of a n * n, each n correspondence a user of forum, the value of each point (i, j) on the matrix is that user j is to the arithmetic sum of each time of user i money order receipt to be signed and returned to the sender content tendentiousness marking.
After this set to improve algorithm iteration convergence error and leader of opinion's selection standard, it is 0.001 that the present invention sets this standard value, and by iterative computation, every PR (A) value is the leader of opinion in the forum greater than this standard value.
Checking obtains to such an extent that whether the leader of opinion causes erroneous judgement because grade sinks by excavation, if exist erroneous judgement then with its deletion, re-executes improved Pagerank algorithm until draw correct result.
5, visual mining result
The visual displayed social network visualization instrument Vizster that relies on open source code that represents realizes.Vizster is the Interactive Visualization instrument of an online social networks, can download by http://hci.stanford.edu/jheer/projects/vizster/download.html.Relation between the user is organized with the form of Xml, as the input of Vizster.Whether the present embodiment has carried out partly revising to Vizster, message display area is simplified to shows between the user money order receipt to be signed and returned to the sender people name, Pagerank value and be the leader of opinion.Simultaneously, consider that the user too much affects bandwagon effect in the forum, can only show leader of opinion and the relatively more active user profile excavated, like this so that Vizster under the prerequisite of not losing any useful information, reaches concisely clearly bandwagon effect.
Because Vizster supports the functions such as visual search, analysis and auto-associating node, can watch at an easy rate the influence power structural drawing of whole forum user by system, and find fast the leader of opinion by color differentiating.

Claims (7)

1. leader of opinion's method for digging of a network-oriented forum comprises leader of opinion's digging system, and leader of opinion's digging system comprises computing center and database server, and database server is communicated by letter with computing center, it is characterized in that, the concrete steps of the method are:
1) utilizes crawler capturing forum data, and utilize message-oriented middleware to improve the real-time that data are processed;
2) extract info web, utilize Chinese automatic word-cut to carry out participle, and utilize Spectral Clustering that rubbish is commented on and filter;
3) utilize Emotional Corpus to carry out text tendency analysis;
4) setting leader of opinion's selection standard value, utilize following formula to determine the leader of opinion:
PR(A)=(1-d)+d*(PR(T1)?*?L(A,T1)+…+PR(Tn)?*L(A,Tn)),
Wherein: the score of the given page A of PR (A) expression, d is damping factor, itself the score of website of A page or leaf is pointed in one of PR (Tn) expression, L (A, Tn) the link degree of correlation of expression webpage A and webpage Tn, L (A, Tn)=(Ua ∩ Un)/(Ua ∪ Un), the chain that Ua represents webpage A goes out, chain enters and the set of the URL of self; The chain that Un represents webpage Tn goes out, chain enters and the set of the URL of self;
As PR (A) during greater than the selection standard value set, namely be defined as the leader of opinion;
5) with 4) in result visualization.
2. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 1), utilizes the reptile Netcrawler that increases income to realize the network forum data acquisition; Described message-oriented middleware is ActiveMQ.
3. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that described step 2) in, utilize regular expression to extract info web; Described Chinese automatic word-cut is the ICTCLAS of Chinese lexical analysis system based on the multilayer hidden Markov model.
4. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that described step 2) in, utilize the Spectral Clustering step that comment is filtered to rubbish to be:
1) gathers the Text Intelligence language material, text is carried out pre-service, obtain comment collection;
2) for the comment collection that obtains, utilize vector space model to carry out character representation every comment, every comment is expressed as a vector in space;
3) generate similar matrix G;
4) the non-canonical Laplacian Matrix of structure is as sample matrix: utilize similarity matrix GDraw adjacency matrix W, then each column element of adjacency matrix being added up obtains NNumber is placed on them on the diagonal line, and other places all are zero, form one N*NMatrix, be designated as D, the order L=D – W, LBe sample matrix;
5) construction feature vector space: obtain LBefore kIndividual eigenwert and characteristic of correspondence vector are with this kIndividual proper vector forms one N*kMatrix, be characteristic vector space, wherein before kIndividual eigenwert is arranged from small to large according to the size of eigenwert;
6) this kOne of individual characteristic series vector composition arranged together N*kMatrix, will be wherein every delegation regard as kA vector in the dimension space, and use spectral clustering to carry out cluster;
7) adopt the Euclidean distance method, calculate each point on the spectral clustering basis to corresponding class centre distance;
8) degree of peeling off of calculating object: with the mean value of above-mentioned distance E (Xi)And variance E (Xi-E (Xi)) 2 Ratio as the basic data of degree of peeling off, then according to degree of peeling off formula Out (i)=E (Xi)/E (Xi-E (Xi)) 2 Calculating degree of peeling off;
9) utilize degree of peeling off to detect the rubbish comment: the rubbish comment as outlier, is then carried out outlier detection, only need degree of peeling off is carried out the Top-n ordering, the object that degree of peeling off is the highest is exactly outlier, namely detected rubbish comment;
10) detected rubbish comment is deleted from database.
5. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 3), the Emotional Corpus of employing is the sentiment analysis word set among the HowNet201104.
6. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 4), the standard value of choosing is 0.001; Damping factor d is made as 0.8.
7. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 5), utilizes Vizster to realize visual.
CN2012102501161A 2012-07-19 2012-07-19 Internet forum-oriented opinion leader mining method Pending CN102890702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012102501161A CN102890702A (en) 2012-07-19 2012-07-19 Internet forum-oriented opinion leader mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012102501161A CN102890702A (en) 2012-07-19 2012-07-19 Internet forum-oriented opinion leader mining method

Publications (1)

Publication Number Publication Date
CN102890702A true CN102890702A (en) 2013-01-23

Family

ID=47534204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012102501161A Pending CN102890702A (en) 2012-07-19 2012-07-19 Internet forum-oriented opinion leader mining method

Country Status (1)

Country Link
CN (1) CN102890702A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150333A (en) * 2013-01-26 2013-06-12 安徽博约信息科技有限责任公司 Opinion leader identification method in microblog media
CN103279484A (en) * 2013-04-23 2013-09-04 中国科学院计算技术研究所 Creating method and creating system facing future opinion leaders in micro-blog system
CN103646097A (en) * 2013-12-18 2014-03-19 北京理工大学 Constraint relationship based opinion objective and emotion word united clustering method
CN104142948A (en) * 2013-05-09 2014-11-12 富士通株式会社 Method and equipment for mining domain review leader
CN104239373A (en) * 2013-06-24 2014-12-24 腾讯科技(深圳)有限公司 Document tag adding method and document tag adding device
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN104750699A (en) * 2013-12-25 2015-07-01 伊姆西公司 Comment data management method and advice
CN104866572A (en) * 2015-05-22 2015-08-26 齐鲁工业大学 Method for clustering network-based short texts
CN105630801A (en) * 2014-10-30 2016-06-01 国际商业机器公司 Method and apparatus for detecting deviated user
CN106354843A (en) * 2016-08-31 2017-01-25 虎扑(上海)文化传播股份有限公司 Web crawler system and method
CN107145897A (en) * 2017-03-14 2017-09-08 中国科学院计算技术研究所 A kind of differentiation network specific group's method for digging and system based on communication space-time characteristic
CN107391775A (en) * 2017-08-28 2017-11-24 湖北省楚天云有限公司 A kind of general web crawlers model implementation method and system
CN107633260A (en) * 2017-08-23 2018-01-26 上海师范大学 A kind of social network opinion leader method for digging based on cluster
CN108009726A (en) * 2017-12-04 2018-05-08 上海财经大学 A kind of things evaluation system of combination user comment
CN108009727A (en) * 2017-12-04 2018-05-08 上海财经大学 A kind of things evaluation method of combination user comment
CN109815395A (en) * 2018-12-26 2019-05-28 北京中科闻歌科技股份有限公司 Webpage garbage information filtering method, device and storage medium
CN110110084A (en) * 2019-04-23 2019-08-09 北京科技大学 The recognition methods of high quality user-generated content
CN110489658A (en) * 2019-07-12 2019-11-22 北京邮电大学 Online social network opinion leader method for digging based on digraph model
CN111460317A (en) * 2020-03-30 2020-07-28 北京百分点信息科技有限公司 Opinion leader identification method, device and equipment
CN111831881A (en) * 2020-07-04 2020-10-27 西安交通大学 Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm
CN112116473A (en) * 2020-09-18 2020-12-22 上海计算机软件技术开发中心 Cross-chain notary mechanism evaluation system and platform
CN114443902A (en) * 2022-02-22 2022-05-06 广州云智达创科技有限公司 Person-to-person analysis method, person-to-person analysis device, storage medium and program product

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040073476A1 (en) * 2002-10-10 2004-04-15 Prolink Services Llc Method and system for identifying key opinion leaders

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040073476A1 (en) * 2002-10-10 2004-04-15 Prolink Services Llc Method and system for identifying key opinion leaders

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴令飞: "寻找"意见领袖":应用Page Rank算法处理社会网络数据的尝试", 《北京大学硕士学位论文》, 31 December 2009 (2009-12-31) *
葛斌等: "网络论坛意见领袖挖掘系统设计与实现", 《电脑知识与技术》, vol. 7, no. 22, 31 August 2011 (2011-08-31), pages 5393 - 5395 *
钟洵: "谱聚类在离群数据挖掘中的应用", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 3, 15 March 2011 (2011-03-15) *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150333B (en) * 2013-01-26 2016-01-13 安徽博约信息科技有限责任公司 Opinion leader identification method in microblog media
CN103150333A (en) * 2013-01-26 2013-06-12 安徽博约信息科技有限责任公司 Opinion leader identification method in microblog media
CN103279484A (en) * 2013-04-23 2013-09-04 中国科学院计算技术研究所 Creating method and creating system facing future opinion leaders in micro-blog system
CN103279484B (en) * 2013-04-23 2016-03-30 中国科学院计算技术研究所 The creation method of a kind of following leader of opinion in micro blog system and system
CN104142948A (en) * 2013-05-09 2014-11-12 富士通株式会社 Method and equipment for mining domain review leader
CN104239373A (en) * 2013-06-24 2014-12-24 腾讯科技(深圳)有限公司 Document tag adding method and document tag adding device
CN103646097B (en) * 2013-12-18 2016-09-07 北京理工大学 A kind of suggestion target based on restriction relation and emotion word associating clustering method
CN103646097A (en) * 2013-12-18 2014-03-19 北京理工大学 Constraint relationship based opinion objective and emotion word united clustering method
CN104750699A (en) * 2013-12-25 2015-07-01 伊姆西公司 Comment data management method and advice
US10614089B2 (en) 2013-12-25 2020-04-07 EMC IP Holding Company LLC Managing opinion data
CN104750699B (en) * 2013-12-25 2019-05-03 伊姆西公司 Method and apparatus for managing opinion data
CN105630801A (en) * 2014-10-30 2016-06-01 国际商业机器公司 Method and apparatus for detecting deviated user
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN104462253B (en) * 2014-11-20 2018-05-18 武汉数为科技有限公司 A kind of topic detection or tracking of network-oriented text big data
CN104866572A (en) * 2015-05-22 2015-08-26 齐鲁工业大学 Method for clustering network-based short texts
CN104866572B (en) * 2015-05-22 2018-05-18 齐鲁工业大学 A kind of network short text clustering method
CN106354843A (en) * 2016-08-31 2017-01-25 虎扑(上海)文化传播股份有限公司 Web crawler system and method
CN107145897B (en) * 2017-03-14 2020-01-07 中国科学院计算技术研究所 Evolution network special group mining method and system based on communication space-time characteristics
CN107145897A (en) * 2017-03-14 2017-09-08 中国科学院计算技术研究所 A kind of differentiation network specific group's method for digging and system based on communication space-time characteristic
CN107633260B (en) * 2017-08-23 2020-10-16 上海师范大学 Social network opinion leader mining method based on clustering
CN107633260A (en) * 2017-08-23 2018-01-26 上海师范大学 A kind of social network opinion leader method for digging based on cluster
CN107391775A (en) * 2017-08-28 2017-11-24 湖北省楚天云有限公司 A kind of general web crawlers model implementation method and system
CN108009726A (en) * 2017-12-04 2018-05-08 上海财经大学 A kind of things evaluation system of combination user comment
CN108009727A (en) * 2017-12-04 2018-05-08 上海财经大学 A kind of things evaluation method of combination user comment
CN108009726B (en) * 2017-12-04 2021-12-28 上海财经大学 Object evaluation system combining user comments
CN109815395A (en) * 2018-12-26 2019-05-28 北京中科闻歌科技股份有限公司 Webpage garbage information filtering method, device and storage medium
CN109815395B (en) * 2018-12-26 2021-06-08 北京中科闻歌科技股份有限公司 Webpage spam filtering method and device and storage medium
CN110110084A (en) * 2019-04-23 2019-08-09 北京科技大学 The recognition methods of high quality user-generated content
CN110489658A (en) * 2019-07-12 2019-11-22 北京邮电大学 Online social network opinion leader method for digging based on digraph model
CN111460317A (en) * 2020-03-30 2020-07-28 北京百分点信息科技有限公司 Opinion leader identification method, device and equipment
CN111831881A (en) * 2020-07-04 2020-10-27 西安交通大学 Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm
CN111831881B (en) * 2020-07-04 2023-03-21 西安交通大学 Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm
CN112116473A (en) * 2020-09-18 2020-12-22 上海计算机软件技术开发中心 Cross-chain notary mechanism evaluation system and platform
CN114443902A (en) * 2022-02-22 2022-05-06 广州云智达创科技有限公司 Person-to-person analysis method, person-to-person analysis device, storage medium and program product

Similar Documents

Publication Publication Date Title
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
US20170004128A1 (en) Device and method for analyzing reputation for objects by data mining
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN109614550A (en) Public sentiment monitoring method, device, computer equipment and storage medium
US8812505B2 (en) Method for recommending best information in real time by appropriately obtaining gist of web page and user's preference
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN107577759A (en) User comment auto recommending method
CN104050163A (en) Content recommendation system and method
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN102119385A (en) Method and subsystem for searching media content within a content-search-service system
TW200925970A (en) Customized today module
CN111192176B (en) Online data acquisition method and device supporting informatization assessment of education
CN102955848A (en) Semantic-based three-dimensional model retrieval system and method
Geçkil et al. A clickbait detection method on news sites
CN110188191A (en) A kind of entity relationship map construction method and system for Web Community's text
CN104199938B (en) Agricultural land method for sending information and system based on RSS
CN102402566A (en) Web user behavior analysis method based on Chinese webpage automatic classification technology
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
CN106503256B (en) A kind of hot information method for digging based on social networks document
CN104050243B (en) It is a kind of to search for the network search method combined with social activity and its system
Zhang et al. An approach of service discovery based on service goal clustering
CN116010552A (en) Engineering cost data analysis system and method based on keyword word library

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130123