The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification



Similar documents
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Research of Postal Data mining system based on big data

Research on Operation Management under the Environment of Cloud Computing Data Center

UPS battery remote monitoring system in cloud computing

IT services for analyses of various data samples

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Blog Post Extraction Using Title Finding

Hexaware E-book on Predictive Analytics

SPATIAL DATA CLASSIFICATION AND DATA MINING

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Application of Data Mining Techniques in Intrusion Detection

A Novel Distributed Denial of Service (DDoS) Attacks Discriminating Detection in Flash Crowds

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Overseas Investment in Oil Industry and the Risk Management System

The Big Data methodology in computer vision systems

ANALYTICS IN BIG DATA ERA

Information & Data Visualization. Yasufumi TAKAMA Tokyo Metropolitan University, JAPAN ytakama@sd.tmu.ac.jp

Spam Detection Using Customized SimHash Function

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Sentiment Analysis on Big Data

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Construction of Library Management Information System

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Random forest algorithm in big data environment

The Design Study of High-Quality Resource Shared Classes in China: A Case Study of the Abnormal Psychology Course

Sentiment analysis on tweets in a financial domain

Facilitating Business Process Discovery using Analysis

Adaptive Discriminating Detection for DDoS Attacks from Flash Crowds Using Flow. Feedback

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Exploration on Security System Structure of Smart Campus Based on Cloud Computing. Wei Zhou

Analecta Vol. 8, No. 2 ISSN

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Cloud Security - Characteristics, Advantages and Disadvantages

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Network Intrusion Detection System and Its Cognitive Ability based on Artificial Immune Model WangLinjing1, ZhangHan2

Search and Information Retrieval

A Time Efficient Algorithm for Web Log Analysis

Ensuring Security in Cloud with Multi-Level IDS and Log Management System

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

TSRR: A Software Resource Repository for Trustworthiness Resource Management and Reuse

Big Data Text Mining and Visualization. Anton Heijs

Discussion on Airport Business Intelligence System Architecture

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

A RFID Data-Cleaning Algorithm Based on Communication Information among RFID Readers

B.Sc (Computer Science) Database Management Systems UNIT-V

Search Engines. Stephen Shaw 18th of February, Netsoc

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Massive Cloud Auditing using Data Mining on Hadoop

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Inner Classification of Clusters for Online News

A Survey on Product Aspect Ranking

The Application Research of Ant Colony Algorithm in Search Engine Jian Lan Liu1, a, Li Zhu2,b

Social Media Mining. Data Mining Essentials

Medical Big Data Interpretation

Chapter ML:XI. XI. Cluster Analysis

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

SharePoint 2010 End User - Level II

WebSphere Business Modeler

Search Result Optimization using Annotators

Big Data: Rethinking Text Visualization

KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE

Why are Organizations Interested?

An Imbalanced Spam Mail Filtering Method

Dynamical Clustering of Personalized Web Search Results

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Build Your Knowledge!

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

The Design and Improvement of a Software Project Management System Based on CMMI

Voice of the Customer: How to Move Beyond Listening to Action Merging Text Analytics with Data Mining and Predictive Analytics

Using Artificial Intelligence to Manage Big Data for Litigation

Dynamic Data in terms of Data Mining Streams

Professor, D.Sc. (Tech.) Eugene Kovshov MSTU «STANKIN», Moscow, Russia

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

Foundations of Business Intelligence: Databases and Information Management

IBM Social Media Analytics

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

Delivering Smart Answers!

Foundations of Business Intelligence: Databases and Information Management

Nagarjuna College Of

Intrusion Detection via Machine Learning for SCADA System Protection

How To Use Neural Networks In Data Mining

Beyond listening Driving better decisions with business intelligence from social sources

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Transcription:

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1 1,2, Zhu Ling-ling 1,Zhu Qian 2,Wang Jin 1 1.School of Computer Science and Technology Nantong University, Nantong 226019, China 2.School of Computer Science and Communications Engineering Jiangsu University, Zhenjiang 212013, China E-MIAL: xycheng@ntu.edu.cn doi:10.4156/jcit.vol5. issue10.7 Abstract Network has become important public platform for the public to express the opinion, to discuss public affairs, to participate in economic social and political life. The spread of network public opinion geometric progression growth, it is necessary to monitor and analyze network Public Opinion for the government to manage the public opinion and to timely discover hot spots and to correctly guide public opinion trends. Therefore, Network Public Opinion monitoring and analyzing have become a hot issue in recent years. Now the main mature technology is the statistical analysis based on key word. However, there is still much room for improving its effectiveness. This paper describes a framework of network Public Opinion monitoring and analyzing system based on semantic content identification to solve some key problems of the public opinion. 1. Introduction Keywords: Network Public Opinion, Natural Language Process, Semantic In recent years, with the development of Internet and the increasing number of netizens, some people disclosure and spread the sensitive and bad through forums, IM, e-mail and so on, which threat the social stability and people's life and property. On the one hand, the national legislation and regulations should put forward higher attentions in the focuses of public opinion to server the public better; on the other hand, Government should takes on vital responsibilities in takes on vital responsibilities in correctly monitoring the sensitive public opinion and guiding them which protect network users from bad and build a harmonious socialist country. According to the preliminary statistics data from Internet center, there have been large directly and indirectly losses since 1996, which we can be seen from Figure 1. Therefore, Network Public Opinion monitoring and analyzing have become an urgent and important issue [1]. Where the Y Axis express State loss (in million dollars on units) Figure 1. The statistics of the harm to society by spreading the Network Public Opinion The most important technologies about network Public Opinion analysis include text filtering, text classifying, clustering, viewpoint tendentiousness recognizing, tracking topics, automatic - 48 -

Journal of Convergence Information Technology Volume 5, Number 10. December 2010 summarizing and so on, which have been concerned about for a long time by domestic and foreign workers. In order to control more effectively, This paper describes a framework of network Public Opinion monitoring and analyzing system based on semantic content identification. 2. Research Situation Researchers from DARPA CMU University of Massachusetts and Dragon Systems, Inc have began to define topic detection and tracking study and developed TDT. The important technology of this project is content classification of, which resolves a contradiction between the processing speed and safety monitoring of the real-time monitoring and make it feasible. There are some studies about it abroad such as the PICS of the W3C which have become classification standard on WWW. There are two International general classification standards: SACi and Safesurf, which are both accord with the PICS. On the one hand, the classification technique is used for web page classification and filtering; on the other hand, the foreign policy and standards are not fully suitable for China's national conditions for various reasons. In China, Founder ZhiSi public opinion warning DSS [2] designed by Institute Founder is successful. The system has successfully achieved automatic real-time monitoring and analysis of the massive public opinion. It is more effective for government to monitor the public option than traditional manual mode.it also do some to strengthen the supervision of internet and play a certain role for the network sudden public events.this DSS provide the function including such as full text retrieval, automatic sorting, automatic cluster, subject examination/tracing, related recommendation and disappear heavy, connection and tendency analysis, automatic abstract and key word extraction, thunderbolt analyzes, generate statistics and so on. Goonie network Public Opinion and monitor system combines internet search technology; intellectualized process technology and knowledge manage method. It realize network public opinion monitor and special news trace to briefing, report and so on through auto collect, auto classify combine, subject collection, focus special topic. Therefore Goonie can master public opinion, make proper consensus and provide report analyze [3]. A framework of content security monitoring system was designed based on human-computer combination in literature [4]. The framework is a hierarchy, there are three levels: the data acquisition layer, content analysis, output layer. It s function mainly examine the based on content by the content analysis and identify the bad ; on the same time, it can provide electronic evidence for the bad use of by recording the source and content of and tracking them by effective audit analysis. Although there are many units engaged in domestic internet content filtering direction of the research, and try to achieve the purpose of purifying network environment. But these techniques are still in the bud, there are still some deficiencies in "the semantic filtering" 3. System Framework The purpose of the system is to achieve a large-scale network environment monitoring report of Network Public Opinion through testing, acquirement, theme, hot topics and events tracking, experiments monitoring and so on, which can form many representation modes of analysis results, such as brief, reports, charts etc. Therefore, the system can master public opinion, make proper consensus and provide report analyze. Monitoring system for network Public Opinion module function block diagram in Figure 2.There are five stages including resource discovery, selection, pattern discovery, extraction, public opinion handling. - 49 -

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin Theme managment Structure knowledge warning filter counter monitor decision public opinion handling client unstructure text structure text Topic search trend analysis abstract report event search extraction Text analysis audit Topic pattern Tendentious analysis Event pattern Semantic index Pattern Discovery Text purification server filter luster expression extracte the subject filtere noise select (Digital Collection format shifting Data Import / Export) text WEB Chatting room E- mail blog BBS Resources discovery Figure 2. logic structure of the network public opinion monitoring system function module Figure 3 is the system workflow. The systems include the following five databases: 1) Public opinion planning database: To collect demand of the Public opinion including the online news, BBS, RSS, chat room, blog, polymerization news (RSS), etc. 2) Public opinion analysis database: To collect the storage data through classification and clustering, keyword extraction, removal the duplication and filter, named entity recognition, semantic computing etc. which database is structured. 3) Public opinion database: To storage products related to public opinion analysis report, survey report, experience summary and related. 4) Semantic dictionary: Ontology knowledge, etc 5) HNC knowledge: 466 sentence knowledge, etc [6]. Public Sentiment Resources discovery Public opinion database programming select extraction Public opinion analysis Pattern Discovery Popular feelings handling Public opinion Products HNC knowledge semantic dictionary Figure 3. System workflow of the network public opinion monitoring - 50 -

Figure 4 is the client workflow Journal of Convergence Information Technology Volume 5, Number 10. December 2010 user Show topic list modify the theme? no Theme selection yes Theme management Server-side processing continue extraction popular feelings handling exit Topic event retrieval retrieval Summary Topic of retrieval public opinion public report warn filter counter minor decision Figure 4. Client workflow Figure 5 is data flow charts of the system. The interaction between the various modules are different: Data interaction is based on file between the resources discovery module and select module; Select module deal with from the text to vector or ontology; Use GATE tagging to name entity in pattern discovery module and determine the relationship between entities and then discover the event pattern or the topic pattern; Information extraction module mainly do the semantic computing and transform the patterns into templates, which will make the unstructured into structure ; Public opinion handling module need to carry on the inquiry according to the user and give these results to user with the suitable manifestation. simultaneously, the module receive the user s establishment and inquiry request. Network Resources discovery extraction Popular feelings handling Analysis of unstructured text Vector, ontology Structured Results Database user requests Results show select Pattern Discovery Server client Figure 5.data flow charts of the system - 51 -

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin Figure 6 is an entire system network topology. The system can be a lot of users; each user can connect to a server. Servers can share data and exchange each other by network, network connection scenarios can be P2P or client-server. Future which will be constantly modified and optimized Network media Adminis trative Extranet Network Communication Office 4. Working Process law enforcement Special content management Figure 6. network topology 4.1. Resources discovery based on latent semantic analysis Resources discovery, which retrieve the necessary network resources, is a process to integrate consolidate mapping data by the different network pattern. There are different retrieval tools and the strategies between the recourses. The BBS, chat room, the e-mail are short and random. First, use the DTS to Import / Export the document, and then eliminate the problem of the algorithm which ignores the environment and synonyms misjudgment based on the theme of latent semantic analysis, while using SVD to achieve the filtering and noise removal purposes. We can find topics drift effectively and timely and meet the requirement of the public surveillance better according to the content of the document similarity calculation and clustering analysis. 4.2. select Select is to achieve specialized from the network by automatically selecting and pre-processing.first, filtered noise, recognize the named entity, extract the subject and the event ;Secondly, classification luster filter the text according to topics or events; Finally, discriminate the text. 1) Text classification based on semi-supervised learning Distinctive feature of public opinion is a short text, which should deal with massive data. The traditional text classification algorithm is a supervised learning, which to learn the calibration samples by the category tag settled and to determinate its category according to the text semantic content. It needs a large label samples trains to a good classifier. It s easy to access the large number of unmarked data but to be high costs and impractical for marked data, which will create a bottleneck When dealing with huge amounts of data by the traditional text classification. We use text classification based on semi-supervised learning to overcome the sparsely of the short text and to improve the accuracy of short text classification algorithm. And in order to increase the robustness of the algorithm, better to avoid falling into local optimal solution; will integrate the Bagging algorithm integrated into semi-supervised learning. 2) Bad detection - 52 -

Journal of Convergence Information Technology Volume 5, Number 10. December 2010 Bad detection is one of the key factor of monitoring system about website content. It is only based on keywords on network for recognizing and filtration for traditional network detection system. If you want to mask a number of cult sites, those who criticize the cult will aloes be filtered out. Therefore, we put forward a method to test poor content based on HNC(figure 7),which are not by way of matching keywords and to judge what text filtering needs according to the meaning of sentences. 4.3. Pattern Discovery Pattern discovery will achieve hot topic detection and concern about the incident tracking and orientation analysis by data mining and semantic computing based on the data from selecting module. The module is core of the system. Pattern Discovery is presented as follows: 1) First, we obtain four tables by using the ICTCLAS researched by the computer software of Chinese Academy of Sciences to achieve word segmentation and POS tagging: Theme Table (ID, title, text, author, time, vector) Comment table (ID, theme, title ID, text, author, time, tendentious value) Topics table (ID, Keywords group, the number of participants, time, polarity, viewpoint opposition, Notes) Topic - theme map table (topic ID, theme ID) A theme ID will be progressive distribution when inserting a database, we will keep on corresponding relations between comments and topic by the theme ID When saving comments. Moreover, the third table holds basic clustering ; the fourth table holds the theme of each cluster contained which is the subject of the topics. Article Pretreat Sentence semantic structure Elements of the framework text short-term memory Sentence Analysis Context Generation Position to judge the semantic HNC 概 念 知 识 库 词 语 知 识 库 HNC Judgments Semantic Library Network map object position Red and black check Red and black objects Library Text nature: 1 absolutely black, 2 absolutely red, 3 black, 4 suspicious III,5 suspicious II,6 suspicious 1,7 neutralities Figure 7. bad detection algorithm diagrams 2) Tendency Analysis First we get ready for tendency dictionary to achieve first dictionary based on marked polarity and strengthen by artificial labeled method in How-Net, and then manually add some common words. We should establish a good tendency dictionary using hash table provided by Java language because that there need to quickly check the inclination. - 53 -

The Framework of Network Public Opinion Monitoring and Analyzing System Based on Semantic Content Identification Cheng Xian-Yi1, Zhu Ling-ling,Zhu Qian,Wang Jin Next, read the text, process it sentence by sentence., remove the stop-words for each sentence, query tendency dictionary word by word, calculate its context polarity and strength for the polarity of the words. Then, add up all the polar components, receive sentences density situation divided by the square root of the number of comments Finally, represent with tendency value according to distributed situation division commentary tendency and rank. 3) Popular feelings key point analysis: Query the comments from the database according to the topic - theme map table and rank the basis for hot spot. Calculate incident concern by the opposition of the topic view combined with its comments. select an initial point based on the basic accumulation unit on a unit of time (for example: days), And then calculate the time point of the topic view by the opposition of counting only the comments before the point in time, latter, opinion in opposition to the added value of this time are received by subtracting the value of a point in time to previous time value.that the trend of events can be obtained 4.4. extraction This module mainly gets structure data and obtains several databases for analysis and confirm or esplanade the mode mined out. we can use GATE [6] : entity recognition, entity-relationship recognition, events recognition, summary generation, etc. 4.5. popular feelings handling 1) The warning Warning module of public opinion collects network ; discover the problem (things) and feedback. Warning is active at a given time period show with the theme related events, the topic of the trend. 2) Filtering Filtering is just too bad. The network management gets rid of negative news by monitoring at all times. Collect sensitive phrase from different fields and set a weight value for each phrase and use intelligent software to find sensitive phrase matching according to weights. The will be shielded beyond a certain threshold established. 3) Counter First, gain its IP, and then lock it. We can use each effective attack method to carry on the fixed-point attack disseminate for unsafe of Hub the website (for example seepage technology, viral technology, advanced hacker attack technology and so on).it can prevent the unsafe from spreading and countering. 4) Monitoring The system lists all the events or topics about the subject after entering the start time of monitoring, the users select the suspected event or topic, monitoring module will continuously monitor. Monitoring and early warning is different that the former is passive surveillance, early warning is active. 5) Decision A complete decision-making is often not possible, but an iterative process. In this process, human-computer interaction can be used by policy makers in the parameters of different options and alternatives. 5. Conclusions There are heavy workloads for the traditional machine learning methods which need to be manually tagging train classifiers netizens. This paper application content identification technology based on semantics to design a framework of analysis and monitoring network Public Opinion system for the comment being relatively short and broad emotional vocabulary. The next step we will pass the experiments to show that the system can achieve a more satisfactory result. - 54 -

Journal of Convergence Information Technology Volume 5, Number 10. December 2010 6. References [1] Li Yonghao. Simulation and analysis of Rapid screening algorithms about network hot topics. Computer Communication Laboratory, Beijing Jiaotong University, internal communication documents. 2006.14-16 [2] Founder Technology Research Institute. Public opinion on science and technology means to support network monitoring and analysis of unexpected events - Founder ZhiSi public opinion warning DSS. Informatization. 2005:50-52 [3] http://www.goonie.cn/news/industrynews/2008/05/2008-05-03122.html [4]Li Yanling. Security monitoring system framework and its key technology of BBS content. Research Institute of China Electronics. 2007,2(4):144-149 [5] Jin Yaohong. HNC language understanding technology and its applications [M]. Beijing: Science Press. 2006. [6]http://gate.ac.uk/ - 55 -