CN102890702A - Internet forum-oriented opinion leader mining method - Google Patents
Internet forum-oriented opinion leader mining method Download PDFInfo
- Publication number
- CN102890702A CN102890702A CN2012102501161A CN201210250116A CN102890702A CN 102890702 A CN102890702 A CN 102890702A CN 2012102501161 A CN2012102501161 A CN 2012102501161A CN 201210250116 A CN201210250116 A CN 201210250116A CN 102890702 A CN102890702 A CN 102890702A
- Authority
- CN
- China
- Prior art keywords
- opinion
- leader
- forum
- comment
- utilize
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses an Internet forum-oriented opinion leader mining method. An opinion leader mining system is involved in the method and comprises a computing center and a database server which communicates with the computing center. The method comprises the following steps of: capturing forum data by using a crawler, and improving data processing real-time property by using message-oriented middleware (MOM); extracting web page information, performing word segmentation by using a Chinese word segmentation system, and filtering spam comments by a spectral clustering method; analyzing text tendency by using an emotional corpus; setting a selection standard value of an opinion leader, and determining the opinion leader; and visualizing a result. By the method, the opinion leader in a forum can be accurately mined, and technical support is provided for related Internet public opinion supervision departments to timely find hot issues and guide the healthy development of Internet public opinions.
Description
Technical field
The present invention relates to the internet information management domain, particularly leader of opinion's method for digging of a kind of network-oriented forum.
Background technology
Along with the fast development of network technology and the rapid growth of netizen's scale, the increasing public participates in society by network and discusses and express social suggestion.Because the internet has the equality of interchange, participates in widely characteristics, many domestic and international focus incidents can form rapidly huge network public opinion pressure, and network has become one of main carriers of reflection Social Public Feelings.
In the forming process of network public-opinion, leader of opinion's booster action is remarkable.The people who can propose guiding opinion, has an extensive social influence is the leader of opinion, claims again opinion leader.The leader of opinion has accumulated higher popularity in the network forum, the public sentiment event ferment with sweat in, the public sentiment main body more easily is subject to leader of opinion's impact, leader of opinion's speech and suggestion tend to affect and change other people suggestion, guiding and the promotion state of affairs further develop, their effect in generation, development and the extinction process of network public-opinion may be positive, also may be passive, thereby leader of opinion's excacation is had important practical significance.Yet in today of internet data explosive growth, traditional dependence complicate statistics data find that leader of opinion's method seems unable to do what one wishes.
Summary of the invention
Technical matters to be solved by this invention is, not enough for prior art, leader of opinion's method for digging of a kind of network-oriented forum is provided, excavate exactly the leader of opinion in the forum, for the timely discovering hot problem of network of relation public sentiment supervision department, the sound development of which directs network public sentiment provide technical support.
For solving the problems of the technologies described above, the technical solution adopted in the present invention is: leader of opinion's method for digging of a kind of network-oriented forum, comprise leader of opinion's digging system, leader of opinion's digging system comprises computing center and database server, database server is communicated by letter with computing center, and the concrete steps of the method are:
1) utilizes crawler capturing forum data, and utilize message-oriented middleware to improve the real-time that data are processed;
2) extract info web, utilize Chinese automatic word-cut to carry out participle, and utilize Spectral Clustering that rubbish is commented on and filter;
3) utilize Emotional Corpus to carry out text tendency analysis;
4) setting leader of opinion's selection standard value, utilize following formula to determine the leader of opinion:
PR(A)=(1-d)+d*(PR(T1)?*?L(A,T1)+…+PR(Tn)?*L(A,Tn)),
Wherein: the score of the given page A of PR (A) expression, d is damping factor, itself the score of website of A page or leaf is pointed in one of PR (Tn) expression, L (A, Tn) the link degree of correlation of expression webpage A and webpage Tn, L (A, Tn)=(Ua ∩ Un)/(Ua ∪ Un), the chain that Ua represents webpage A goes out, chain enters and the set of the URL of self; The chain that Un represents webpage Tn goes out, chain enters and the set of the URL of self;
As PR (A) during greater than the selection standard value set, namely be defined as the leader of opinion;
5) with 4) in result visualization.
As preferred version, in the step 1), utilize the reptile Netcrawler that increases income to realize the network forum data acquisition; Described message-oriented middleware is ActiveMQ.
As preferred version, step 2) in, utilize regular expression to extract info web; Described Chinese automatic word-cut is the ICTCLAS of Chinese lexical analysis system based on the multilayer hidden Markov model.
Step 2) in, utilize the Spectral Clustering step that comment is filtered to rubbish to be:
1) gathers the Text Intelligence language material, text is carried out pre-service, obtain comment collection;
2) for the comment collection that obtains, utilize vector space model to carry out character representation every comment, every comment is expressed as a vector in space;
3) generate similar matrix G;
4) the non-canonical Laplacian Matrix of structure is as sample matrix: utilize similarity matrix
GDraw adjacency matrix
W, then each column element of adjacency matrix being added up obtains
NNumber is placed on them on the diagonal line, and other places all are zero, form one
N*NMatrix, be designated as D, the order
L=D – W,
LBe sample matrix;
5) construction feature vector space: obtain
LBefore
kIndividual eigenwert and characteristic of correspondence vector are with this
kIndividual proper vector forms one
N*kMatrix, be characteristic vector space, wherein before
kIndividual eigenwert is arranged from small to large according to the size of eigenwert;
6) this
kOne of individual characteristic series vector composition arranged together
N*kMatrix, will be wherein every delegation regard as
kA vector in the dimension space, and use spectral clustering to carry out cluster;
7) adopt the Euclidean distance method, calculate each point on the spectral clustering basis to corresponding class centre distance;
8) degree of peeling off of calculating object: with the mean value of above-mentioned distance
E (Xi)And variance
E (Xi-E (Xi)) 2 Ratio as the basic data of degree of peeling off, then according to degree of peeling off formula
Out (i)=E (Xi)/E (Xi-E (Xi)) 2 Calculating degree of peeling off;
9) utilize degree of peeling off to detect the rubbish comment: the rubbish comment as outlier, is then carried out outlier detection, only need degree of peeling off is carried out the Top-n ordering, the object that degree of peeling off is the highest is exactly outlier, namely detected rubbish comment;
10) detected rubbish comment is deleted from database.
As preferred version, in the step 3), the Emotional Corpus of employing is the sentiment analysis word set among the HowNet201104.
As preferred version, in the step 4), the standard value of choosing is 0.001; Damping factor d is made as 0.8.
As preferred version, in the step 5), utilize Vizster to realize visual.
The present invention comments on towards the magnanimity in the Internet forum, designed and Implemented network forum leader of opinion's automatic mining system, can excavate exactly the leader of opinion in the forum, for the timely discovering hot problem of network of relation public sentiment supervision department, the sound development of which directs network public sentiment provide technical support.
Description of drawings
Fig. 1 is one embodiment of the invention leader of opinion automatic mining system hardware platform structure schematic diagram;
Fig. 2 is one embodiment of the invention reptile configuration interface figure;
Fig. 3 is that one embodiment of the invention is utilized the Spectral Clustering flow chart of steps that comment is filtered to rubbish;
Fig. 4 is the basic metacharacter of regular expression;
Fig. 5 is the sentiment analysis word collection of HowNet.
Embodiment
As shown in Figure 1, one embodiment of the invention leader of opinion's automatic mining system mainly is comprised of data center and computing center, the data that data center is a database server stores crawls from internet forum, and for computing center provides data, services, thereby the data that computing center then provides the data center by series of algorithms are processed and are excavated the leader of opinion.
One embodiment of the invention leader of opinion's automatic mining method step is as follows:
1, gathers forum's data
(1) utilizes the crawler capturing forum data of increasing income
The network forum data acquisition is the data basis that the leader of opinion excavates, the utilization of the present invention reptile Netcrawler that increases income realizes the network forum data acquisition, this is a front end system that Web creeps, can be along link roaming Web collection of document, its basic functional principle is identical with the reptile that carries out BFS (Breadth First Search) based on seed URL: by given seed URL, utilize the standard agreements such as HTTP to read respective document, then with all were not accessed in the document URL as new starting point, proceed roaming, until the new URL that does not satisfy condition.The major function of Netcrawler is all webpages of downloading in the territory, then analyze and process all coherent elements, comprise picture, literal, audio frequency, video etc., can download up-to-date Netcrawler version at http://freecode.com/projects/netcrawler.Network address by given forum homepage network address or a certain channel of forum is obtained the webpage source code that comprises theme and the money order receipt to be signed and returned to the sender page in this forum's channel as seed URL, and concrete money order receipt to be signed and returned to the sender metadata is resolved to be operated in the webpage pre-service and carried out.The present invention has kept system kernel, and the key element that forum is gathered (comprise title, text, the people that posts, the time of posting, ip address, clicks, answer number etc.) configures, and changing system interface, to make it operation more friendly.The network forum seed URL crawl webpage that provides according to the user and relevant model also store this locality into, for system append and more new data raw data is provided.
Reptile configuration interface such as Fig. 2 are provided with seed URL, Thread Count, preservation position, the crawl degree of depth, crawl type and keyword and express formula.
(2) utilize message-oriented middleware to improve the real-time that data are processed
Message-oriented middleware (MOM, Message-Oriented Middleware) is a kind of specific middleware.It utilizes the message passing mechanism of high efficient and reliable to carry out the data exchange of platform independence, and based on data communication to carry out distributed system integrated.The most important function of message-oriented middleware is exactly that reliable message communicating means in time are provided.In order to finish the transmitting of message, the mode of normal operation formation is carried out message management, that is to say, usually when carrying out data transmission, with data according to user-defined size, split into some message elements and put into message queue, middleware can send or receipt message according to synchronous or asynchronous communication mode.In the operating process of reality, in order to ensure the message transmitting, often use the technology such as message priority, breakpoint transmission, reliable news formation, memory queue, some has also added flow control, has built the functions such as connection in advance.The core essence of message-oriented middleware is the message transmission, the message transmission be a kind ofly can support at a high speed, asynchronous, program is to the technology of program reliable communication.
That the present embodiment is used is message-oriented middleware ActiveMQ.ActiveMQ is that Java messenger service (JMS) open standard is issued and realized to an open source code based on Apache 2.0 licenced, be that a kind of measured message transmits solution, can download up-to-date ActiveMQ at http://activemq.apache.org/download.html.System is in the crawl webpage, webpage URL, webpage crawl time and this store path of webpage are sent to the message queue of ActiveMQ as message, by the message synchronization pass through mechanism among the ActiveMQ, info web in the formation is passed to the webpage preprocessing process in real time, realize the new web page that timely perception gets access to, thereby new web page is processed.In addition, ActiveMQ also provides system to dispose required interoperability, security, scalability, availability, ease of manageability and other functions.
After having grasped original web page, in order to obtain the comment in the webpage, also to carry out info web and extract.Because html language as a kind of semi-structured descriptive language, is paid close attention to more the visual effect (font, size, color, position etc.) that represents content, and has been ignored the institutional framework of content.After the record in the database is formatted into html web page, lost structural information in description.The purpose browsed of user usually include a plurality of models in a content page, and each model visually is rendered as relatively independent message block for convenience.This characteristic is reflected on the html document structural sheet of webpage, just show as each message block corresponding to a relatively independent DOM Document Object Model (Document Object Model, DOM) subtree, all message block subtrees all are positioned under the same father node, and have identical internal structural characteristic.The comment webpage generally adopts form to hold data, and form is comprehensive HTML structure, and the html tag of mainly using has<TABLE 〉,<TH 〉,<TR 〉,<TD 〉, utilize these labels can specify the alignment placement of form.
2, extract info web, utilize Chinese automatic word-cut to carry out participle, and utilize Spectral Clustering that rubbish is commented on and filter
(1) extracts info web
After having grasped original web page, in order to obtain the comment in the webpage, also to carry out info web and extract.Because html language as a kind of semi-structured descriptive language, is paid close attention to more the visual effect (font, size, color, position etc.) that represents content, and has been ignored the institutional framework of content.After the record in the database is formatted into html web page, lost structural information in description.The purpose browsed of user usually include a plurality of models in a content page, and each model visually is rendered as relatively independent message block for convenience.This characteristic is reflected on the html document structural sheet of webpage, just show as each message block corresponding to a relatively independent DOM Document Object Model (Document Object Model, DOM) subtree, all message block subtrees all are positioned under the same father node, and have identical internal structural characteristic.The comment webpage generally adopts form to hold data, and form is comprehensive HTML structure, and the html tag of mainly using has<TABLE 〉,<TH 〉,<TR 〉,<TD 〉, utilize these labels can specify the alignment placement of form.
Take out message from message queue, extract the Web Page Metadata that draws and will be stored in the database table, the comment content of extraction is stored into local hard drive by the form with document, and shines upon one by one with the record maintenance in the database table.The metadata of webpage comprises URL, source web, local store path, review title, utterer ID name, access number, the answer number of comment and delivers the time, local store path is as the main identification code of webpage, is used for the related of local comment document and data base table recording.
Based on the markup language characteristic of HTML, can utilize regular expression (Regular Expression) to carry out useful information and extract.Regular expression is mainly used in text based search and editor, can realize data verification, text replacement, can also extract substring from character string according to pattern match.Fig. 4 has described several metacharacters commonly used in the regular expression.The present invention utilizes the regular expression of C# language in Visual Studio 2008 platforms to extract useful information from the webpage source code.
Extract text message in the webpage by matching regular expressions, comprise web page title, author, the content of posting, money order receipt to be signed and returned to the sender person, money order receipt to be signed and returned to the sender content etc.Deposit the text message after extracting in the oracle database, the database version that the present invention uses is oracle 10g.
(2) Chinese word segmentation
Participle is the crucial and basic of text mining, the present embodiment has used the free version of magnanimity participle development interface, this software is present widely used a kind of Chinese automatic word-cut, the ICTCLAS(Institute of Computing Technology of Chinese lexical analysis system based on the multilayer hidden Markov model by Inst. of Computing Techn. Academia Sinica's development, Chinese Lexical Analysis System), higher participle accuracy rate is not only arranged, and participle efficient is also better.Its major function comprises Chinese word segmentation, part-of-speech tagging, named entity recognition, neologisms identification, supports simultaneously user-oriented dictionary, supports Chinese-traditional, supports GBK, UTF-8, the Multi-encoding forms such as UTF-7, UNICODE.Present magnanimity Words partition system participle speed unit 996KB/s, the precision of word segmentation 98.45%, API is no more than 200KB, various dictionary data compressions are rear less than 3M, and ICTCLAS all adopts C/C++ to write, supporting Linux, FreeBSD and Windows series operating system, support the development language of C/C++, C#, the main flows such as Delphi, Java, is current preferably Chinese lexical analysis device.This development interface is comprised of HLSSplit.dll routine interface and HLSSplit.dll.dat corpus two parts, can pass through the internet Free Acquisition.
The present embodiment is by calling the HLSSplit.dll of magnanimity participle, based on magnanimity HLSSplit.dll.dat corpus, text is carried out participle, obtain the information such as Chinese vocabulary, part-of-speech tagging, vocabulary position, word frequency statistics, the result behind the participle is used for the rubbish comment and filters.
(3) filter based on the rubbish comment of cluster
It is nonsensical that a lot of answers are arranged in the money order receipt to be signed and returned to the sender content that coupling obtains, such as small advertisement, incoherent answer, repeat to reply etc., in order effectively to excavate the leader of opinion, the person that should analyze the money order receipt to be signed and returned to the sender is to the significant reply content of posting person, with the support of this person that obtains the money order receipt to be signed and returned to the sender to the posting person.Thereby must filter the rubbish comment.
Rubbish comment based on cluster is filtered key step as shown in Figure 3:
1) for the comment collection that obtains, utilize vector space model to carry out character representation every comment, every comment is expressed as a vector in space.The present invention adopts vector space model that text is represented, in vector space model, and a text
d i Be counted as one by a stack features
(t 1 , t 2 ..., t n )Form
nDimensional vector, like this text
d i Be reduced to the vector representation take the weight of feature as component
(w I1 , w I2 ..., w In ), weight
w Ij Representation feature
t j =(j=1,2 ... n)To text
d i The significance level of classification can be with text
d i Be expressed as
d j =(w I1 , w I2 ..., w In )
Wherein
t In Be characteristic item, it can be word, word or phrase;
w In Be characteristic item
t In Weight, the expression
t In Significance level in comment takes the method TF-IDF of the Determining Weights relatively commonly used to represent significance level herein.
2) generation of similar matrix.Similarity matrix has comprised the in twos similarity of comment, therefore every comment with vector representation after, need to utilize cosine formula that similarity is carried out in comment and calculate two comments
d i With
d j Between similarity
Sim (d i , d j )Its calculation procedure is as follows:
The similarity measurement computing method
Input: two text vectors
d i With
d j
Output: similarity matrix
G
Wherein, the similarity between two documents can represent with the included angle cosine between the vector of its correspondence, i.e. document d
iAnd d
jSimilarity can be expressed as:
Sim (d i , d j )=di*dj/|di|*|dj|
3) the non-canonical Laplacian Matrix of structure is as sample matrix.At first utilize similarity matrix
GDraw adjacency matrix
W, then each column element of adjacency matrix being added up obtains
NNumber is placed on (other places all are zero) on the diagonal line to them, forms one
N*NMatrix, be designated as D.Order
L=D – W,
LBe sample matrix.
4) construction feature vector space.Construction feature vector space demand goes out
LBefore
kIndividual eigenwert (in this article, unless specified otherwise, " front
kIndividual " refer to the size order from small to large according to eigenwert) and the characteristic of correspondence vector, with this
kIndividual proper vector forms one
N*kMatrix be characteristic vector space.
5) this
kOne of the vectorial composition arranged together of individual feature (row)
N*kMatrix, will be wherein every delegation regard as
kA vector in the dimension space, and use spectral clustering to carry out cluster.
6) calculate each point on the spectral clustering basis to corresponding class centre distance.This paper adopts the method for Euclidean distance to corresponding class distance calculating method to each point.If two points
A=(a[1], a[2] ..., a[n])With
B=(b[1], b[2] ..., b[n])Between distance
ρ (A, B)Be defined as following formula:
ρ(A,B)?=sqrt?[?∑(?a[i]?-?b[i]?)^2?]?(i?=?1,2,…,n)
7) degree of peeling off of calculating object.Calculate first the mean value of above-mentioned distance
E (Xi)And variance
E (Xi-E (Xi)) 2 Ratio as the basic data of degree of peeling off, then according to degree of peeling off formula
Out (i)=E (Xi)/E (Xi-E (Xi)) 2 Calculating degree of peeling off.
8) utilize degree of peeling off to detect the rubbish comment.Detection method based on spectral clustering (Spectral Clustering), the rubbish comment as outlier, is then carried out outlier detection, only need degree of peeling off is carried out the Top-n ordering, the object that degree of peeling off is the highest is exactly outlier, namely detected rubbish comment.
Determining a certain comment is after rubbish is commented on, and record that this comment is corresponding is deleted from database.
3, analyze Text Orientation
During the analysis user Relations Among, except basic annexation, it is core content that a user calculates another user's support, carries out based on sentiment classification by the content of text to model and can improve the accuracy that support is analyzed.Text tendency analysis refers to by subjective informations such as the view in excavation and the analysis text, tastes, and the emotion tendency of text is made classification judge.
The method of text tendency analysis is more at present, the present invention has carried out the comment of Chinese word segmentation and rubbish in text pre-service link and has filtered, for leader of opinion's characteristics, relatively low to the computational accuracy requirement of based on sentiment classification, and being carried out efficient, algorithm has higher requirement.The present invention is based on Emotional Corpus and carry out based on sentiment classification, the Emotional Corpus that adopts is the sentiment analysis word set among the HowNet201104, this corpus is Free Acquisition on the internet, and the Chinese word of choosing wherein makes up polarity dictionary, and the word collection scale that obtains after gathering as shown in Figure 5.
Forum's Emotional Corpus of setting up has provided the score value that quantizes of each emotion word.For example: the score value of " good " is 1, and the score value of " fine " is 2, and " poor " score value is-1, and the score value of " very poor " is-2.
Just can carry out tendentiousness marking to every comment after setting up corpus, for example comment for " building-owner analyzes finely; but have a bit extreme ", " fine " and " extreme " all expressed money order receipt to be signed and returned to the sender person's emotion tendency in the words, " fine " score value is 2, " extreme " score value is-1, and the score value that obtains this comment after the weighted sum is 1.After drawing marking this score value is inserted in " tendentiousness score value " field of this comment corresponding record in the database.
4, excavate the leader of opinion
Traditional Pagerank algorithm:
Pagerank relies on be user name to the supporting rate of website, utilize a large amount of link structures to show the value of certain independent webpage.It similarly is a ballot of being initiated by the every other page on the internet, and decides the importance of a page with this.The link of certain page of sensing represents one and takes ticket, if link is pointed to it, that just is equivalent to not prop up take ticket.
The value of Pagerank is defined as follows: supposition page A has T1 ... it (is T1 that these pages of Tn point to it ... Tn reference page A).Parameter d is a damping parameter that is arranged between 0 and 1.In addition, C (A) is defined as the number of links from A.Then the Pagerank value of webpage A is drawn by following formula.
PR(A)=(1-d)+d*(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)),
Wherein, the score of the given page A of PR (A) expression; D is damping factor, usually is made as 0.8; Itself the score of website of A page or leaf is pointed in one of PR (Tn) expression; C (T1) represents that the chain that this page has goes out quantity; The page quantity of A page or leaf and the ratio of the page quantity that A points to other pages are pointed in PR (Tn)/C (Tn) expression.The formation probability distribution of Pagerank value in whole webpage colony is so all the Pagerank value sum of webpages is 1.
The chain that the external linkage B of A page or leaf can bring the score of A and B goes out quantity and is inversely proportional to, and namely along with the B cochain goes out several increases, the score that brings A decreases.This shows that equally the score of a webpage is that this webpage is to a basic metric form of other webpage ballots.A webpage can be voted to one or more derivation links, but its total voting right is certain, and is divided equally to all derivation links.That supposes B must be divided into 5, and only has a link of pointing to A on the B, and A will obtain the whole score of B so, and B does not lose anything, and A has won the score of B.But if n link arranged on the B, then A could only obtain the n of the score of B/one.
Improved Pagerank algorithm:
Be not difficult to find out from the definition of conventional P agerank algorithm, in fact it suppose to be linked to all webpage Ti of webpage A, and the probability that jumps to A from Ti along link is the inverse of Ti web page interlinkage out-degree, supposes that namely the probability that Ti jumps to any webpage is identical.This does not conform to actual conditions, and in forum, a user can have a lot of chains to go out, and can support the model that a lot of people deliver, but the weight of these links is different certainly, and it is relevant with money order receipt to be signed and returned to the sender person and posting person's the degree of correlation.Can portray this weight with the link degree of correlation, be expressed as follows:
L(A,Ti)=(Ua∩Ui)/(?Ua∪Ui),
Wherein, the link degree of correlation of L (A, Ti) expression webpage A and webpage Ti; The chain that Ua represents webpage A goes out, chain enters and the set of the URL of self; The chain that Ui represents webpage Ti goes out, chain enters and the set of the URL of self.In the network forum, this method is equivalent to regard original direct net as indirected net, expression be the ratio that the number of posting person and the money order receipt to be signed and returned to the sender person model of replying the another one posting person and posting person and money order receipt to be signed and returned to the sender person deliver the model sum, ratio is larger, both correlativitys are larger.
Thereby obtain revised formula:
PR(A)=(1-d)+d*(PR(T1)?*?L(A,T1)+…+PR(Tn)?*L(A,Tn)),
Wherein, the score of the given page A of PR (A) expression; D is damping factor, is made as 0.8; Itself the score of website of A page or leaf is pointed in one of PR (Tn) expression; The link degree of correlation of L (A, Tn) expression webpage A and webpage Tn.
Improved Pagerank algorithm points for attention:
Programme according to algorithm and just can calculate the score of user in the forum, score the higher person just can be regarded as the leader of opinion.But should be noted that following problem in this algorithmic procedure concrete the use:
(1) setting of iteration convergence error
The Pagerank algorithm is an iteration convergence algorithm, so determine convergence error, the setting of error needs certain skill, set too muchly, result's accuracy is doubtful, sets too smallly, the required time of obtaining a result is long, sometimes within observing time even can not get the result, through test of many times, the setting data error is 0.001.
(2) leader of opinion's selection standard
Here said standard is actually a numerical value, needs namely to consider how many scores namely can be considered the leader of opinion greater than, and this is difficult to obtain a unified standard.In the ends of the earth Economic Forum of investigating, one has more than 20000 user, and all user's total points are 1, through test of many times, it is more rational as standard figures that discovery gets 0.001, chooses this leader of opinion who is worth going out and all meets by analysis location to the leader of opinion.
(3) solution of grade sinking phenomenon
If running into several isolated nodes in algorithm uses votes mutually, suppose that namely user A and user B have supported respectively the other side, and user A and user B do not obtain the support of other users in the forum, when carrying out computing, both branches that gets are stabilized in a high value even can be increase tendency, final score can be higher than selection standard like this, thereby thinks user A and B by mistake leader of opinion, and this phenomenon is called the grade sinking.Needing isolated node is rejected when meeting with the grade sinking, is the user who is considered to cause the grade sinking if read a user when algorithm is realized, so directly it is skipped, and this user can not enter the algorithm matrix, has so just well avoided the grade sinking.
The concrete application of improved Pagarank algorithm:
Structural matrix at first before execution algorithm is excavated the leader of opinion, suppose that there be n user in forum, so just, construct the matrix of a n * n, each n correspondence a user of forum, the value of each point (i, j) on the matrix is that user j is to the arithmetic sum of each time of user i money order receipt to be signed and returned to the sender content tendentiousness marking.
After this set to improve algorithm iteration convergence error and leader of opinion's selection standard, it is 0.001 that the present invention sets this standard value, and by iterative computation, every PR (A) value is the leader of opinion in the forum greater than this standard value.
Checking obtains to such an extent that whether the leader of opinion causes erroneous judgement because grade sinks by excavation, if exist erroneous judgement then with its deletion, re-executes improved Pagerank algorithm until draw correct result.
5, visual mining result
The visual displayed social network visualization instrument Vizster that relies on open source code that represents realizes.Vizster is the Interactive Visualization instrument of an online social networks, can download by http://hci.stanford.edu/jheer/projects/vizster/download.html.Relation between the user is organized with the form of Xml, as the input of Vizster.Whether the present embodiment has carried out partly revising to Vizster, message display area is simplified to shows between the user money order receipt to be signed and returned to the sender people name, Pagerank value and be the leader of opinion.Simultaneously, consider that the user too much affects bandwagon effect in the forum, can only show leader of opinion and the relatively more active user profile excavated, like this so that Vizster under the prerequisite of not losing any useful information, reaches concisely clearly bandwagon effect.
Because Vizster supports the functions such as visual search, analysis and auto-associating node, can watch at an easy rate the influence power structural drawing of whole forum user by system, and find fast the leader of opinion by color differentiating.
Claims (7)
1. leader of opinion's method for digging of a network-oriented forum comprises leader of opinion's digging system, and leader of opinion's digging system comprises computing center and database server, and database server is communicated by letter with computing center, it is characterized in that, the concrete steps of the method are:
1) utilizes crawler capturing forum data, and utilize message-oriented middleware to improve the real-time that data are processed;
2) extract info web, utilize Chinese automatic word-cut to carry out participle, and utilize Spectral Clustering that rubbish is commented on and filter;
3) utilize Emotional Corpus to carry out text tendency analysis;
4) setting leader of opinion's selection standard value, utilize following formula to determine the leader of opinion:
PR(A)=(1-d)+d*(PR(T1)?*?L(A,T1)+…+PR(Tn)?*L(A,Tn)),
Wherein: the score of the given page A of PR (A) expression, d is damping factor, itself the score of website of A page or leaf is pointed in one of PR (Tn) expression, L (A, Tn) the link degree of correlation of expression webpage A and webpage Tn, L (A, Tn)=(Ua ∩ Un)/(Ua ∪ Un), the chain that Ua represents webpage A goes out, chain enters and the set of the URL of self; The chain that Un represents webpage Tn goes out, chain enters and the set of the URL of self;
As PR (A) during greater than the selection standard value set, namely be defined as the leader of opinion;
5) with 4) in result visualization.
2. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 1), utilizes the reptile Netcrawler that increases income to realize the network forum data acquisition; Described message-oriented middleware is ActiveMQ.
3. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that described step 2) in, utilize regular expression to extract info web; Described Chinese automatic word-cut is the ICTCLAS of Chinese lexical analysis system based on the multilayer hidden Markov model.
4. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that described step 2) in, utilize the Spectral Clustering step that comment is filtered to rubbish to be:
1) gathers the Text Intelligence language material, text is carried out pre-service, obtain comment collection;
2) for the comment collection that obtains, utilize vector space model to carry out character representation every comment, every comment is expressed as a vector in space;
3) generate similar matrix G;
4) the non-canonical Laplacian Matrix of structure is as sample matrix: utilize similarity matrix
GDraw adjacency matrix
W, then each column element of adjacency matrix being added up obtains
NNumber is placed on them on the diagonal line, and other places all are zero, form one
N*NMatrix, be designated as D, the order
L=D – W,
LBe sample matrix;
5) construction feature vector space: obtain
LBefore
kIndividual eigenwert and characteristic of correspondence vector are with this
kIndividual proper vector forms one
N*kMatrix, be characteristic vector space, wherein before
kIndividual eigenwert is arranged from small to large according to the size of eigenwert;
6) this
kOne of individual characteristic series vector composition arranged together
N*kMatrix, will be wherein every delegation regard as
kA vector in the dimension space, and use spectral clustering to carry out cluster;
7) adopt the Euclidean distance method, calculate each point on the spectral clustering basis to corresponding class centre distance;
8) degree of peeling off of calculating object: with the mean value of above-mentioned distance
E (Xi)And variance
E (Xi-E (Xi)) 2 Ratio as the basic data of degree of peeling off, then according to degree of peeling off formula
Out (i)=E (Xi)/E (Xi-E (Xi)) 2 Calculating degree of peeling off;
9) utilize degree of peeling off to detect the rubbish comment: the rubbish comment as outlier, is then carried out outlier detection, only need degree of peeling off is carried out the Top-n ordering, the object that degree of peeling off is the highest is exactly outlier, namely detected rubbish comment;
10) detected rubbish comment is deleted from database.
5. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 3), the Emotional Corpus of employing is the sentiment analysis word set among the HowNet201104.
6. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 4), the standard value of choosing is 0.001; Damping factor d is made as 0.8.
7. leader of opinion's method for digging of network-oriented according to claim 1 forum is characterized in that, in the described step 5), utilizes Vizster to realize visual.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012102501161A CN102890702A (en) | 2012-07-19 | 2012-07-19 | Internet forum-oriented opinion leader mining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012102501161A CN102890702A (en) | 2012-07-19 | 2012-07-19 | Internet forum-oriented opinion leader mining method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102890702A true CN102890702A (en) | 2013-01-23 |
Family
ID=47534204
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012102501161A Pending CN102890702A (en) | 2012-07-19 | 2012-07-19 | Internet forum-oriented opinion leader mining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102890702A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150333A (en) * | 2013-01-26 | 2013-06-12 | 安徽博约信息科技有限责任公司 | Opinion leader identification method in microblog media |
CN103279484A (en) * | 2013-04-23 | 2013-09-04 | 中国科学院计算技术研究所 | Creating method and creating system facing future opinion leaders in micro-blog system |
CN103646097A (en) * | 2013-12-18 | 2014-03-19 | 北京理工大学 | Constraint relationship based opinion objective and emotion word united clustering method |
CN104142948A (en) * | 2013-05-09 | 2014-11-12 | 富士通株式会社 | Method and equipment for mining domain review leader |
CN104239373A (en) * | 2013-06-24 | 2014-12-24 | 腾讯科技(深圳)有限公司 | Document tag adding method and document tag adding device |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN104750699A (en) * | 2013-12-25 | 2015-07-01 | 伊姆西公司 | Comment data management method and advice |
CN104866572A (en) * | 2015-05-22 | 2015-08-26 | 齐鲁工业大学 | Method for clustering network-based short texts |
CN105630801A (en) * | 2014-10-30 | 2016-06-01 | 国际商业机器公司 | Method and apparatus for detecting deviated user |
CN106354843A (en) * | 2016-08-31 | 2017-01-25 | 虎扑(上海)文化传播股份有限公司 | Web crawler system and method |
CN107145897A (en) * | 2017-03-14 | 2017-09-08 | 中国科学院计算技术研究所 | A kind of differentiation network specific group's method for digging and system based on communication space-time characteristic |
CN107391775A (en) * | 2017-08-28 | 2017-11-24 | 湖北省楚天云有限公司 | A kind of general web crawlers model implementation method and system |
CN107633260A (en) * | 2017-08-23 | 2018-01-26 | 上海师范大学 | A kind of social network opinion leader method for digging based on cluster |
CN108009726A (en) * | 2017-12-04 | 2018-05-08 | 上海财经大学 | A kind of things evaluation system of combination user comment |
CN108009727A (en) * | 2017-12-04 | 2018-05-08 | 上海财经大学 | A kind of things evaluation method of combination user comment |
CN109815395A (en) * | 2018-12-26 | 2019-05-28 | 北京中科闻歌科技股份有限公司 | Webpage garbage information filtering method, device and storage medium |
CN110110084A (en) * | 2019-04-23 | 2019-08-09 | 北京科技大学 | The recognition methods of high quality user-generated content |
CN110489658A (en) * | 2019-07-12 | 2019-11-22 | 北京邮电大学 | Online social network opinion leader method for digging based on digraph model |
CN111460317A (en) * | 2020-03-30 | 2020-07-28 | 北京百分点信息科技有限公司 | Opinion leader identification method, device and equipment |
CN111831881A (en) * | 2020-07-04 | 2020-10-27 | 西安交通大学 | Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm |
CN112116473A (en) * | 2020-09-18 | 2020-12-22 | 上海计算机软件技术开发中心 | Cross-chain notary mechanism evaluation system and platform |
CN114443902A (en) * | 2022-02-22 | 2022-05-06 | 广州云智达创科技有限公司 | Person-to-person analysis method, person-to-person analysis device, storage medium and program product |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040073476A1 (en) * | 2002-10-10 | 2004-04-15 | Prolink Services Llc | Method and system for identifying key opinion leaders |
-
2012
- 2012-07-19 CN CN2012102501161A patent/CN102890702A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040073476A1 (en) * | 2002-10-10 | 2004-04-15 | Prolink Services Llc | Method and system for identifying key opinion leaders |
Non-Patent Citations (3)
Title |
---|
吴令飞: "寻找"意见领袖":应用Page Rank算法处理社会网络数据的尝试", 《北京大学硕士学位论文》, 31 December 2009 (2009-12-31) * |
葛斌等: "网络论坛意见领袖挖掘系统设计与实现", 《电脑知识与技术》, vol. 7, no. 22, 31 August 2011 (2011-08-31), pages 5393 - 5395 * |
钟洵: "谱聚类在离群数据挖掘中的应用", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 3, 15 March 2011 (2011-03-15) * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150333B (en) * | 2013-01-26 | 2016-01-13 | 安徽博约信息科技有限责任公司 | Opinion leader identification method in microblog media |
CN103150333A (en) * | 2013-01-26 | 2013-06-12 | 安徽博约信息科技有限责任公司 | Opinion leader identification method in microblog media |
CN103279484A (en) * | 2013-04-23 | 2013-09-04 | 中国科学院计算技术研究所 | Creating method and creating system facing future opinion leaders in micro-blog system |
CN103279484B (en) * | 2013-04-23 | 2016-03-30 | 中国科学院计算技术研究所 | The creation method of a kind of following leader of opinion in micro blog system and system |
CN104142948A (en) * | 2013-05-09 | 2014-11-12 | 富士通株式会社 | Method and equipment for mining domain review leader |
CN104239373A (en) * | 2013-06-24 | 2014-12-24 | 腾讯科技(深圳)有限公司 | Document tag adding method and document tag adding device |
CN103646097B (en) * | 2013-12-18 | 2016-09-07 | 北京理工大学 | A kind of suggestion target based on restriction relation and emotion word associating clustering method |
CN103646097A (en) * | 2013-12-18 | 2014-03-19 | 北京理工大学 | Constraint relationship based opinion objective and emotion word united clustering method |
CN104750699A (en) * | 2013-12-25 | 2015-07-01 | 伊姆西公司 | Comment data management method and advice |
US10614089B2 (en) | 2013-12-25 | 2020-04-07 | EMC IP Holding Company LLC | Managing opinion data |
CN104750699B (en) * | 2013-12-25 | 2019-05-03 | 伊姆西公司 | Method and apparatus for managing opinion data |
CN105630801A (en) * | 2014-10-30 | 2016-06-01 | 国际商业机器公司 | Method and apparatus for detecting deviated user |
CN104462253A (en) * | 2014-11-20 | 2015-03-25 | 武汉数为科技有限公司 | Topic detection or tracking method for network text big data |
CN104462253B (en) * | 2014-11-20 | 2018-05-18 | 武汉数为科技有限公司 | A kind of topic detection or tracking of network-oriented text big data |
CN104866572A (en) * | 2015-05-22 | 2015-08-26 | 齐鲁工业大学 | Method for clustering network-based short texts |
CN104866572B (en) * | 2015-05-22 | 2018-05-18 | 齐鲁工业大学 | A kind of network short text clustering method |
CN106354843A (en) * | 2016-08-31 | 2017-01-25 | 虎扑(上海)文化传播股份有限公司 | Web crawler system and method |
CN107145897B (en) * | 2017-03-14 | 2020-01-07 | 中国科学院计算技术研究所 | Evolution network special group mining method and system based on communication space-time characteristics |
CN107145897A (en) * | 2017-03-14 | 2017-09-08 | 中国科学院计算技术研究所 | A kind of differentiation network specific group's method for digging and system based on communication space-time characteristic |
CN107633260B (en) * | 2017-08-23 | 2020-10-16 | 上海师范大学 | Social network opinion leader mining method based on clustering |
CN107633260A (en) * | 2017-08-23 | 2018-01-26 | 上海师范大学 | A kind of social network opinion leader method for digging based on cluster |
CN107391775A (en) * | 2017-08-28 | 2017-11-24 | 湖北省楚天云有限公司 | A kind of general web crawlers model implementation method and system |
CN108009726A (en) * | 2017-12-04 | 2018-05-08 | 上海财经大学 | A kind of things evaluation system of combination user comment |
CN108009727A (en) * | 2017-12-04 | 2018-05-08 | 上海财经大学 | A kind of things evaluation method of combination user comment |
CN108009726B (en) * | 2017-12-04 | 2021-12-28 | 上海财经大学 | Object evaluation system combining user comments |
CN109815395A (en) * | 2018-12-26 | 2019-05-28 | 北京中科闻歌科技股份有限公司 | Webpage garbage information filtering method, device and storage medium |
CN109815395B (en) * | 2018-12-26 | 2021-06-08 | 北京中科闻歌科技股份有限公司 | Webpage spam filtering method and device and storage medium |
CN110110084A (en) * | 2019-04-23 | 2019-08-09 | 北京科技大学 | The recognition methods of high quality user-generated content |
CN110489658A (en) * | 2019-07-12 | 2019-11-22 | 北京邮电大学 | Online social network opinion leader method for digging based on digraph model |
CN111460317A (en) * | 2020-03-30 | 2020-07-28 | 北京百分点信息科技有限公司 | Opinion leader identification method, device and equipment |
CN111831881A (en) * | 2020-07-04 | 2020-10-27 | 西安交通大学 | Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm |
CN111831881B (en) * | 2020-07-04 | 2023-03-21 | 西安交通大学 | Malicious crawler detection method based on website traffic log data and optimized spectral clustering algorithm |
CN112116473A (en) * | 2020-09-18 | 2020-12-22 | 上海计算机软件技术开发中心 | Cross-chain notary mechanism evaluation system and platform |
CN114443902A (en) * | 2022-02-22 | 2022-05-06 | 广州云智达创科技有限公司 | Person-to-person analysis method, person-to-person analysis device, storage medium and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102890702A (en) | Internet forum-oriented opinion leader mining method | |
CN103324665B (en) | Hot spot information extraction method and device based on micro-blog | |
CN104899273B (en) | A kind of Web Personalization method based on topic and relative entropy | |
US20170004128A1 (en) | Device and method for analyzing reputation for objects by data mining | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
CN109614550A (en) | Public sentiment monitoring method, device, computer equipment and storage medium | |
US8812505B2 (en) | Method for recommending best information in real time by appropriately obtaining gist of web page and user's preference | |
CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
CN107577759A (en) | User comment auto recommending method | |
CN104050163A (en) | Content recommendation system and method | |
CN103914478A (en) | Webpage training method and system and webpage prediction method and system | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
CN102119385A (en) | Method and subsystem for searching media content within a content-search-service system | |
TW200925970A (en) | Customized today module | |
CN111192176B (en) | Online data acquisition method and device supporting informatization assessment of education | |
CN102955848A (en) | Semantic-based three-dimensional model retrieval system and method | |
Geçkil et al. | A clickbait detection method on news sites | |
CN110188191A (en) | A kind of entity relationship map construction method and system for Web Community's text | |
CN104199938B (en) | Agricultural land method for sending information and system based on RSS | |
CN102402566A (en) | Web user behavior analysis method based on Chinese webpage automatic classification technology | |
CN110134845A (en) | Project public sentiment monitoring method, device, computer equipment and storage medium | |
CN106503256B (en) | A kind of hot information method for digging based on social networks document | |
CN104050243B (en) | It is a kind of to search for the network search method combined with social activity and its system | |
Zhang et al. | An approach of service discovery based on service goal clustering | |
CN116010552A (en) | Engineering cost data analysis system and method based on keyword word library |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130123 |