Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach



Similar documents
Real-Time Analytics: Integrating Social Media Insights with Traditional Data

Leveraging Global Media in the Age of Big Data

LEVERAGING OPEN SOURCE INTELLIGENCE (OSINT) TO COMBAT FRAUD

Exploiting Big Data from the Deep Web

Delivering Smart Answers!

Jobvite Hire: High Tech. The innovative recruiting solution that fuels your innovation. Follow Jobvite:

How To Use Social Media To Improve Your Business

Taking A Proactive Approach To Loyalty & Retention

Attack Intelligence: Why It Matters

A RE YOU SUFFERING FROM A DATA PROBLEM?

Find, track, pipeline, and manage your highly-skilled talent.

1. Layout and Navigation

Agile BI with automated, real-time Data Collection

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

What is Prospect Analytics?

Top 3 Marketing Metrics You Should Measure in Google Analytics

Analyzing Big Data: The Path to Competitive Advantage

Case Study. Using Knowledge: Advances in Expertise Location and Social Networking

The creative recruiting solution for hiring creative people.

Get results with modern, personalized digital experiences

Master big data to optimize the oil and gas lifecycle

Overcoming Obstacles to Retail Supply Chain Efficiency and Vendor Compliance

How To Build A Big Data Solution For Healthcare

We are Big Data A Sonian Whitepaper

Discover How a 360-Degree View of the Customer Boosts Productivity and Profits. eguide

IBM Content Analytics: Rapid insight for crime investigation

emarketing There are lots of reasons to shift your e-marketing to Tikit emarketing v6

Predicting From the Edge in an

decisions that are better-informed leading to long-term competitive advantage Business Intelligence solutions

CASE STUDY: SPIRAL16

HOW THE DATA LAKE WORKS

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

WHITE PAPER. Virtual Impact. The Internet s Effect on How Candidates Look for Jobs and How Companies Look for Candidates.

Get Ready for Tomorrow, Today. Redefine Your Security Intelligence

Supply Chains: From Inside-Out to Outside-In

XpoLog Center Suite Log Management & Analysis platform

Contents. Meltwater Quick-Start Guide

Structured Content: the Key to Agile. Web Experience Management. Introduction

Fogbeam Vision Series - The Modern Intranet

Understanding the Deep Web in 10 Minutes

SharePoint 2010 vs. SharePoint 2013 Feature Comparison

6 BEST SOCIAL SELLING TECHNIQUES TO HELP YOU MOVE THE CONVERSATION NEEDLE

How To Protect Yourself From A Threat From Social Media

Best Practices for Building a Security Operations Center

Social Media Monitoring in Fifteen Minutes

Monitoring the Social Media Conversation: From Twitter to Facebook

Jobvite Hire Review - The Next Generation of Recruitment

Your Device is Our Opportunity

Ignite Your Creative Ideas with Fast and Engaging Data Discovery

DATA MANAGEMENT FOR THE INTERNET OF THINGS

SEO & CONTENT MARKETING SERVICES

Big Data Solution Transforms Healthcare with Faster Access to Information

Solve your toughest challenges with data mining

[Ramit Solutions] SEO SMO- SEM - PPC. [Internet / Online Marketing Concepts] SEO Training Concepts SEO TEAM Ramit Solutions

Creating a Digital Marketing Strategy

The Business Value of a Comprehensive All-in-One Data Protection Solution for Your Organization

Record Tagging for Microsoft Dynamics CRM

The Purview Solution Integration With Splunk

SOCIAL MEDIA: The Tailwind for SEO & Lead Generation

From Top-Line Growth to Bottom-Line Profits: 10 Reasons to Use Automated Web Data Monitoring and Extraction

Excel in online marketing

YOUR DATA UNDER SIEGE: GUARD THE GAPS WITH PATCH MANAGEMENT. With Kaspersky, now you can. kaspersky.com/business Be Ready for What s Next

CHEMICAL REACTIONS: Unleashing Your Most Valuable Asset

Taming Big Data. 1010data ACCELERATES INSIGHT

2015 SEO AND Beyond. Enter the Search Engines for Business.

Transforming Big Data Into Smart Advertising Insights. Lessons Learned from Performance Marketing about Tracking Digital Spend

MORE DATA - MORE PROBLEMS

Top Ten Keys to Gaining Enterprise Configuration Visibility TM WHITEPAPER

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Centercode Platform. Features and Benefits

Capturing Meaningful Competitive Intelligence from the Social Media Movement

XpoLog Center Suite Data Sheet

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Torquex Customer Engagement Analytics. End to End View of Customer Interactions and Operational Insights

How To Create A Social Media Program

THOMSON ONE INVESTMENT MANAGEMENT

XML and the College Website A Practical Look at the Use of XML and XSL

POWERFUL SOFTWARE. FIGHTING HIGH CONSEQUENCE CYBER CRIME. KEY SOLUTION HIGHLIGHTS

Premier Internet Marketing Workshop

Social Media Marketing for Local Businesses

Digital Marketing Capabilities

Whitepaper Series. Search Engine Optimization: Maximizing opportunity,

Measuring your Social Media Efforts

the beginner s guide to SOCIAL MEDIA METRICS

Jabil builds momentum for business analytics

5 Big Data Use Cases to Understand Your Customer Journey CUSTOMER ANALYTICS EBOOK

HOW TO IMPROVE YOUR WEBSITE FOR BETTER LEAD GENERATION Four steps to auditing your site & planning improvements

Leveraging Big Social Data

Best Practices in Contract Migration

Contents of This Paper

V E N D O R P R O F I L E. F i c s t a r : S i m p l i f y i n g W e b D a t a E x t r a c t i o n I D C O P I N I O N

Setting smar ter sales per formance management goals

1. Introduction to SEO (Search Engine Optimization)

Advanced Threat Detection: Necessary but Not Sufficient The First Installment in the Blinded By the Hype Series

Targeted measurable marketing

Online Reputation Management Services

The objective setting phase will then help you define other aspects of the project including:

Data Analytics for Healthcare: Creating understanding from big data

Pinterest has to be one of my favourite Social Media platforms and I m not alone!

THE MOBlLE APP. REVOLUTlON. 8 STEPS TO BUlLDING MOBlLE APPS FAST ln THE CLOUD

Transcription:

Unlocking The Value of the Deep Web Harvesting Big Data that Google Doesn t Reach

Introduction Every day, untold millions search the web with Google, Bing and other search engines. The volumes truly are staggering Google alone processes more than 3.5 billion search requests every day. However, all of these searches only scratch the surface. The web is like an iceberg the vast bulk is underwater, concealed from view. Welcome to the Deep Web. This huge wealth of knowledge is hidden because search engines index content by following links. When they encounter search boxes, forms, or password protected content, they simply ignore them and move on. Unless the searchable content is linked separately to surface pages to make it accessible, it is completely invisible to Google, Bing and Yahoo. Connotate lets you tap into this enormous repository of high value hidden content. It navigates sites intelligently, filling in the forms and search boxes you select to automatically extract the exact content you need. Its massively scalable extraction architecture easily handles hundreds of thousands of web sources in near real time, delivering targeted content that powers your business. 2

The Deep Web. Why Should I Care? The web is the world s largest source of big data data that delivers actionable intelligence, profound insights and major economic opportunities. Much of this data is open source it is available to anyone. By harvesting and analyzing this data, you can transform your organization vastly extending your capabilities, building innovative products, and creating profound competitive advantage. For example, the deep web is a critical source of information for government and intelligence agencies. In fact, the ability to harvest big data from the web has been key to the Open Source Intelligence (OSINT) community s success in keeping America safe since 9/11. By monitoring a vast array of open web sources including deep web content, Facebook pages, message boards, tweets, blogs, RSS feeds and others they detect and respond to many threats, including terrorism, human trafficking and other national security concerns. Other government agencies use the web s big data to stay abreast of their regulatory environment, improve their decision making, and respond more effectively to their constituents. 3 However, the enormous benefits of the deep web aren t limited to government and the intelligence community. Companies that leverage deep web content gain access to a hidden wealth of information, completely unlevelling the playing field.

How Can I Benefit From the Deep Web? Many types of organizations can exploit the power of the deep web. Here are few examples. Information Businesses Create new, high value products and services. Deepen coverage and enrich content. Aggregate content from thousands of sites. Track company news and organizational changes. Gather comprehensive market research data. Compile up to date financial and credit profiles. Manufacturers & Fortune 500 Companies Develop near real time competitive intelligence. Analyze pricing across thousands of e tailing sites. Track product placement on websites. Monitor distributors for competitors inventory levels. Uncover unauthorized discounts and flash sales. Detect dumping and counterfeiting. Content Centric Software Vendors Make apps sticky with near real timecontent. Dramatically increase content richness. Create unique value by correlating content. Compare data such as pricing on 1000 s of sites. Law Enforcement Agencies Monitor forums for signs of criminal intent. Search federal, state and county court records. Track suspects online activities. Interrogate watchlists and offender registries. 4 However, unlocking this value with traditional methods is a major challenge

Challenge #1: The Scale Is Overwhelming The web is growing exponentially. Just 10 years ago, there were approximately 65 million websites. Today, there are approximately 1 billion websites, and that number continues to grow dramatically. Twitter sees 500 million tweets every day, or 200 billion a year. The amount of content per site is also exploding for example, there are 347 million LinkedIn users and 1.44 billion Facebook users, each generating unique content pages. Exploiting this content much of it hidden in the deep web is a huge challenge. The sheer scale makes manual search approaches such as Google and Bing completely impractical. While manual methods may work for small, one off projects, there is no way they can cope with tasks such as ongoing newsgathering where thousands of websites need to be monitored for hourly changes. Traditional automated approaches such as web scrapers and scripts also can t cope with this enormous amount of data. For instance, while web crawlers can collect complete pages or sites, they are totally indiscriminate. They take everything, leaving you with the enormous problem of identifying and extracting the content you actually need such as a price update, new listing or name change. 5

Challenge #2: Web Content Isn t Structured Data comes in three types structured, semi structured and unstructured. Structured data includes things such as databases and spreadsheets and is neatly organized and named for example, a database is tables, columns and rows. This makes it easy to analyze data and connect it across multiple sources for example, joining database tables based on a key value. However, the vast majority of useful information on the web is semi structured. It lives in web pages, message boards, tweets, emails, PDFs, Word documents, and many other formats none of which impose any rigid structure. There is nothing to identify individual pieces of data or how they relate to each other. This makes semi structured content almost impossible to analyze. To use the content, you need to take it apart and describe the data components creating structured data setsout of semi structured information. 6 Turning semi structured data into structured data is an enormous task. If you do it manually, it has to be done again and again for every page of content that you collect a laborious and error prone process. Traditional automated approaches also break down quickly it can take many hours to code a software script to extract structured data from a single webpage.

Challenge #3: The Speed of Change The web changes constantly. Content is often dynamic, updated in hours or even minutes. This makes it a huge challenge to keep up. Search engines and traditional tools such as web scrapers don t detect these changes they simply give you the content from the last time the site was crawled and indexed. Looking for changes manually is an enormous task it s expensive and error prone. Not only that, but when web sources change very quickly as with news sites it is virtually impossible to identify changes in a timely manner. This has a direct impact on the value of your information products and services. For instance, if you produce custom aggregated newsfeeds, then it s not acceptable to wait a day or more for new news items. Website formats also change on a regular basis. This creates havoc with custom extraction scripts, which usually break when there is even a minor layout change. Detecting when scripts have broken is a major undertaking, and updating code to fix them is an even bigger effort. This makes it impossible to scale data harvesting across a large number of sites. In fact, many organizations struggle as they approach 200 websites a small fraction of the number they need to extract. 7

Google Isn t Enough To Harvest the Web Google and other search engines index billions of pages of information. That s no small task, but it s not enough to harvest web content at scale. To start with, search engines simply don t reach the deep web. They just touch the surface, leaving the vast majority of the information you need hidden from view. Search engines also fall short when searching surface content. They return links one page at a time, rather than collecting all of the relevant and valuable content that your business needs. You re forced to go through this one click at a time which can take more that an hour for a single search. Multiply that by thousands of searches, and the process rapidly breaks down. Worse still, search engines take a shotgun approach that generates huge amounts of irrelevant results making it even more challenging to gather the exact content you need. The results are also often out of date they only reflect the last time the search engine indexed a site. There is no automatic way to extract precise, up to date content and certainly no easy way to monitor websites for updates. 8 To unlock the power of the web s open content, you need a sharp shooter approach one that can be pointed directly at the relevant sources you care about, scales massively, looks for precise content, and detects changes, additions and deletions.

Connotate Unlocks the Deep Web Connotate is the industry leader in massively scalable web content harvesting. Using Connotate, you can extract highvalue content from tens of thousands of websites including content hidden in the deep web. Simply identify the precise content you need using intuitive visual tools, and Connotate does the rest. There s no need to code complex scripts, or to post process content to filter out irrelevant information. Connotate turns unstructured web content into structured data sets. It tags data elements so that they are described consistently, and includes powerful data manipulation tools that automatically restructure data into the format you want. It normalizes content across sites, making it easy to correlate and aggregate data, and lets you automatically filter data based on specific search terms. 9 Connotate is designed for the web s constant change. Its unique visual approach to website content extraction uses machine learning to pinpoint the exact content you need. This approach automatically adapts to most website format changes, dramatically reducing maintenance overhead compared to scripts and other HTML parsing approaches. Connotate also detects changes in website content simply turn on change monitoring, and Connotate sends alerts whenever a source changes identifying what s changed right down to the character level.

5 Steps to Scoping a Web Harvesting Project Harvesting the deep web can create enormous value, but to maximize your benefits you need to plan ahead. It s incredibly important to scope your project before you implement. Here are 5 steps that will help you to do this. Understand Your Goals What data do you need, how will it be used, and who will use it? Will your data end up in a database, an analytics application or somewhere else? How does your data need to be structured so that you can use it effectively? How will you derive actionable insights from your data? Look at Your Existing Processes Find all of the people in your organization who are collecting web content today. Identify all of the cases where they are doing this manually. These are your top targets for automated web harvesting. Identify Your Web Sources Identify all of the different categories of web sources that you need. For example, product data from competitors, industry news, regulatory filings, List out all of the sites that can give you this data. Drill down into each site and identify specific pages and URLs. 10 Narrow Your Scope Plan for the Future Don t assume you need to collect massive amounts of content very frequently. This is a natural instinct, but you may not need everything. Often, a well selected subset of data can deliver the insights you need, especially if you just want to see what s changed. Think about what information you will need in future. Basic data may be sufficient now, but you may need more detail down the road. It s often easier to configure additional data sources now and turn them on later.

Conclusion The deep web is an incredible source of high value content content that can transform your data supply chain. However, search engines don t reach the deep web and other traditional approaches such as manual extraction and custom scripts are impossibly expensive and error prone. To unlock the value of the deep web, a different approach is needed one that is massively scalable, automated, precise and intelligent. Connotate is designed from the ground up for highvolume web content extraction. It navigates the deep web intelligently, filling in search boxes and forms to extract the precise content you need. Its patented visual approach vastly reduces the effort needed to extract website content at scale, and lets Connotate adapt automatically most website format changes dramatically reducing the cost of harvesting web content. Connotate is trusted by leading information providers and data centric companies around the globe. They rely on Connotate s massively scalable web harvesting to deliver the content they need to create high value, innovative products and services. 11 Find out how we can do the same for you. Talk to us today.

Find out more about how Connotate can unleash the power of the deep web for your business. Request a Consultation Or visit us at Connotate.com today.