Our Data & Methodology Understanding the Digital World by Turning Data into Insights
Understanding Today s Digital World SimilarWeb provides data and insights to help businesses make better decisions, identify new opportunities, and spot the latest Internet and mobile trends. This information is essential for reacting to the Internet s ever-changing environment, building high-reward low-risk campaigns, and understanding the competitive world in which you operate. The Digital Challenge Today s world is run by big data and algorithms, but the challenges of turning big data into useful real world insights remain for all. Few organizations have the resources to measure their entire digital world, and even then the focus mostly remains on being reactive to the data. At SimilarWeb we provide the platforms which empower proactive marketers, publishers, and analysts to understand their competitive landscape, build sound strategies, and drive the future of their businesses. 2/13
SimilarWeb s marketing insights are powered by four industry leading technical assets: SimilarWeb Panel SimilarWeb Crawler ISP & Partnership Data Direct Measurement Learning Set SimilarWeb s competitive advantage comes from our ability to combine all these diverse resources into a full picture of our digital world. 3/13
The SimilarWeb Panel SimilarWeb s consumer insights panel, the largest in the industry, is the foundation upon which SimilarWeb s web insights stand. While others interpolate and assume, only SimilarWeb has the reach and depth to deliver unmatched accuracy and timeliness. Size The SimilarWeb panel is the largest in the industry and has tens of millions of monthly users. These users generate a staggering number of daily pageviews which run well into the billions, adding hundreds of GB of data daily to SimilarWeb s databases which process over 35,000 requests every second. To achieve timely and accurate data about Internet usage SimilarWeb relies on statistical sampling to measure a representative sample of the web and then make intelligent and precise estimations about actual web usage. Statistical sampling is quite common when working with large data sets. One popular example of sampling are the polls that estimate how voters will choose a president during an election. At SimilarWeb we apply the same principles but supplement them with multiple data sources to increase accuracy and remove bias. Our sample of the Internet population is taken from our user panel, which is the largest and most diverse in the industry. 4/13
Geographic Diversity The SimilarWeb panel includes users from over 200 countries and they provide our local representative data. Our data provides audience insights that will show you how the world finds a website or app, and how users in a specific country behave as a local entity. SimilarWeb also works with Internet service providers around the globe to supplement panel data with ISP data to increase the diversity of sources and expand our sample size. User Diversity SimilarWeb s panel includes data from tens of thousands of distinct sources, each representing various demographic groupings and user characteristics. Taken together, all of our sources provide a balanced picture of Internet usage. SimilarWeb builds its panel using a number of different user acquisition strategies. SimilarWeb is constantly growing its panel, responding to the specialized data needs of its customers. To accurately reflect the digital world in which we live, SimilarWeb s panel includes users on desktop computers and mobile devices. 5/13
Building a User Panel The SimilarWeb panel is based on data shared by panel participants who receive free software in exchange for them agreeing to share their anonymous aggregate traffic statistic. We collect data from desktop software, mobile apps, and browser extensions. To ensure that we have access to a wide diversity of Internet users we collect data from tens of thousands of different software s. User Opt In The software that collects data for SimilarWeb is either built in-house by the SimilarWeb team or, in most cases, is connected to us via partnerships. In either case we have a strict 2 step notification policy to inform users that their data will be collected. The software or extension description must explicitly state that data will be collected and the type of data being collected. The user must also agree to install or add the application/extension by clearly seeing a warning that their web usage data will be collected. Anonymous Aggregate Statistics At SimilarWeb we NEVER track personal identifying information such as usernames, passwords, retail transactions or any other information that can be used to personally identify an individual. To measure the Internet we track websites & mobile traffic data only. This means simply that we only care about Websites or Mobile Apps data and NOT USERS. We aggregate the data per property to give us a picture of what is happening in the digital world. 6/13
ISP & Partner Data SimilarWeb also works with Internet service providers and similar partners around the globe to supplement panel data with their data to increase and improve the diversity and data of sources and expand our sample size. As with the data that we collect directly, all of our partners must adhere to our strict standards in which no personally identifying information is passed to SimilarWeb; the only data that they collect on our behalf is related to clickstream or app data. 7/13
From Raw to Refined Data The data presented at SimilarWeb.com and in the SimilarWeb PRO platform is not the raw data that is collected. It is the refined data that has been extrapolated from our user panel and cleaned to remove bias and outliers. The Algorithm The number of estimated visits for a given site is calculated using a sophisticated Bayesian estimation algorithm, the results of which are propagated to the six traffic verticals that SimilarWeb displays: direct, referrals, mail, search, display and social. SimilarWeb calculates visits at a daily resolution, allowing for fine grained analysis. With SimilarWeb s daily data, you can see the periodicity of web traffic and measure traffic jumps from marketing campaigns, media coverage, or new product releases. Direct Measurement While it is nearly impossible to track the entirety of Internet traffic, it is possible to track traffic on individual sites with a degree of precision. To improve the accuracy of our estimation calculations we maintain a learning set of websites who share with us their directly measured web traffic. By comparing our estimations to their precise data we can measure the accuracy of our algorithms and make adjustments to our final refined data. We use the learning set of directly measured Internet data to understand the quality and diversity of our data sources as well as understand how representative our panel is to the larger Internet population. 8/13
Accuracy SimilarWeb s commitment to accuracy starts with its representative user panel, the largest and most comprehensive in the industry. SimilarWeb s sources of data are carefully curated and monitored for biases. SimilarWeb then cleans and screens its raw data in order to reduce noise and remove unrepresentative samples from the calculation. SimilarWeb also excludes inactive users and penalizes users that show abnormally high usage rates, minimizing sources of data that are statistically found to be outliers for a given site. The traffic estimation algorithm finally assigns weights to each source, resulting in a highly accurate visit calculation for each site. 9/13
Mapping the Internet with the SimilarWeb Crawler In order to measure the Internet we must map the different properties within to get a picture of what users are doing online. SimilarWeb Crawler SimilarWeb s Internet crawler supplements the information collected by the panel and analyzes over 1 billion pages a month, supplying input data for SimilarWeb s sophisticated similarity, category and content analysis engines. The data SimilarWeb collects from crawlingthe internet powers a number of advanced analysis engines. SimilarWeb doesn t just report on traffic metrics, but understands and learns from internet content. SimilarWeb s similarity, categorization and tagging algorithms provide sophisticated ways to filter and analyze websites based on categories, industries, and geographies. 10/13
SimilarWeb s Content Analysis The data SimilarWeb crawls from the internet powers a number of advanced internet analysis engines. SimilarWeb doesn t just report on traffic metrics, but understands and learns from internet content. SimilarWeb s similarity, categorization and tagging algorithms provide sophisticated ways to filter and analyze websites. All of SimilarWeb s content analysis systems are accessible by API and power a number of mission critical adult content detection, categorization and semantic analysis applications. Similarity The ability to return a list of related websites according to a specific website or topic was SimilarWeb s first breakthrough. Search engines will return websites based on a specific query, but will not recognize that users who visit NBA.com would also find relevant content on ESPN s website. The similarity system is based on a number of inputs including website structure, link analysis, user surfing behavior and a large community of user rankings. 11/13
Categories and Tags The categorization algorithm is able to accurately classify an unknown website as one of 25 main categories and 219 sub-categories. The ability to algorithmically generate categories for a given list of websites is enormously powerful and can be used for lead generation, marketing segmentation and online filtering. SimilarWeb s categorization engine uses a multi-class learning algorithm to generate a category for a given website using website content tags, similarity results and a learning set of 2.5M categorized websites. SimilarWeb s categorization results are constantly improving through machine learning and incorporation of customer input. Category results are rigorously cross validated and tested. The tagging system returns a list of tags for a given site that best describe a site or apps content. The tagging system is used as an input for SimilarWeb s similarity and categorization engines, but is also useful for categorization tasks that require open ended and dynamic results. 12/13
The Hardware Powering SimilarWeb s Big Data Infrastructure A staggering amount of data is required to compute traffic and engagement metrics for a constantly changing and ever growing Internet. Many companies say they work with big data, but SimilarWeb s 1.3 petabyte sized operation truly embodies the term. Over 250 servers work around the clock crunching SimilarWeb s machine learning algorithms and visit rate calculations. SimilarWeb s server infrastructure is built around redundancy and delivers high degrees of uptime and availability. Internet marketing never sleeps. Midnight in New York is the middle of the day in Hong Kong and SimilarWeb is committed to reliably serving web insights around the clock. 13/13