Soma: Linked Data Infrastructure
What is Soma? It s Big Data Candy for the Cloud. The Soma platform helps Data Scientist to collaborate together to discover and share new facts from large datasets hosted on shared infrastructure. All this while lowering development & operations bottom line.
Meet our Customers Expert See themselves as experts or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation. Researcher See themselves as scientists. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders. Creative People who see themselves as Data artists. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative. Engineer See themselves as engineers. Focused on the technical problem of managing data how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
Customers we support now Engineer Focused on the technical problem of managing data Normally strong software developers Creative Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative. Researcher People with deep academic background in science, maths, machine learning Reluctant coders.
What we deliver to customers Engineer Now: Big Data Cluster Container Management November: Storage frameworks Creative Now: Gitlab integration from gitlab Web facing applications Researcher Now: Discovery early adopters Early September Discovery platform rollout
Features Fully operational big data station Right Now Mesos based Cloud O/S Cluster of 88 CPUs 295 GB of memory Distributed Application Scheduling Resource Scheduling Container Management DNS service discover
Deployment Gitlab Mesos Cluster Zookeeper Cluster HDFS Cluster Integrated DNS CI servers Docker Registry
Deeper Dive Gitlab All applications MUST be in gitlab Mesos Cluster and Container Manager Let s have a look at what is running right now:
Lambda architecture can mix both batch and real-time processing process at batch and realtime Velocity
Data sources
Features Source Control Management Continuous Deployment Service Monitoring Always available key datasets DBPedia SemanticWeb Dogfood
Continuous Deployment 1. Have gitlab account 2. Ask Research ops to add Soma Role to your project 3. If you are accepted you will be guided through dockerizing you gitlab project 4. Once accepted, every push to your master branch will be deployed and accessible online through soma.
Features Integrated Discovery platform SOMA Discover - hosted discovery tool based on smarter data project allowing exploration of data and sharing results. Other internal tools such as Sig.ma, Social Lens, and other projects to follow.
Goals for Research Ops Nurture a Data Engineering community at Insight with supportive experts, shared tools & best practices Provide a Shared analytics platform for Data Scientists at Insight (Soma) Encourage new research and engagements with the wider big data analytics research community
Nurture Provide a structured approach to managing and releasing all Engineering IP (Code and Data) at insight Source control (Git) release management Assist in IP management Provide Quality Circles for Engineering practices 2 Groups - Data Visualisation & Big Data, Workshops to commence this month.
Provide Build big data infrastructure for Insight Soma platform Support Hadoop ongoing development Hadoop clusters, Dataspace support Support Ad Hoc projects requiring scale Cancer atlas Provide Big Data Expertise to the Linked Data group Hadoop, Yarn, Mesos, Spark, Dataspace, Mongo and Virtuoso
Problems being met High cost in research when data scales to Big Data [P1] Ad Hoc Maintenance of big data sets is expensive [P2] Development complexity of valuable Big Data jobs is prohibitive [P3] The high cost in Operating Big Data infrastructure [P4] Scarcity of hardware and lack of funds for new Hardware [P5] Inability to maintain a core operations team [P7] Missed opportunity for researcher to collaborate [P6]
Soma serving our customers Soma Create - Serves data fresh from the source. Has queryable large datasets that are both highly available & up-to-date. Has service to mash these up. Soma Engineer - Provides a Lambda architecture consuming, cleaning, processing and loading the data to the data layer. Soma Discover - Useful blocks of processing that can connected together using a nice GUI, works with many datastores Soma Expert - vertical applications solving a real world problem, these apps are built by Insight s Data Researchers and Data Creatives.
The 4 kinds of Data Scientist Expert See themselves as experts or an authority on a subject. Wants the big picture, likes easy to use specialised applications with great visualisation. Researcher See themselves as scientists. People with deep academic background in maths, machine learning & modeling complex processes. Reluctant coders. Creative People who see themselves as Data artists. Need to explain the meaning of the data. Good generalists, can code, with a flare for the visual or data narrative. Engineer See themselves as engineers. Focused on the technical problem of managing data how to get it, store it, and learn from it. Normally strong software developers with some O/R statistics.
Goals Soma to be a complete ecosystem to help researchers deliver Big Data distributed applications Showcase Insight expertise Standardize best practices for linked data at big data scales Delivers targeted applications & tools tools to build complex analytics apps & job management
Distributed O/S (Better than cloud) We use Mesos based infrastructure to provide Scheduling Process Execution of Jobs/Applications across the cluster Resource scheduling of the needed CPU/Memory/Storage for these applications
SOMA Discover (Data)
Where we are now What we have Soma Engineer - Standard Mesos platform - Provides a Lambda architecture consuming, cleaning, processing and loading the data to the data layer. Soma Discover - Smarter Data - an interactive expressive query tool creates data blocks & visualisations What we need help on Soma Expert - Pivoty - a medical index built from standard HCLS datasets and uses a Pivot Browser Soma Create - The Insight Standard Dataset - a shared queryable standard set of big-data sources