Big Data: Strategies and Synergies Melinda H. Connor D.D., Ph.D., AMP, FAM
Melinda H. Connor, D.D., Ph.D., AMP, FAM Adjunct Professor, Akamai University, Hilo, Hawaii Science Advisor, Spirituals for the 21st Century, Georgia and Nolan Payton Archive of Sacred Music, California State University Dominguez Hills CEO, National Foundation for Energy Healing Melinda_Connor@mindspring.com
What are the Big Issues around Big Data?
Challenges: Quality of programming skills of the computer programmers. Level of problem definition. Level of actual problem understanding in the specific area. Correct hardware to solve the issue. Correct software to solve the issue.
Challenges con t: Intersection and compatibility of the hardware and software. Intersection and compatibility of the software on multiple platforms. Understanding of the end user needs. Production of the reports in a format that the end user can understand.
Client Quote I don t care how your software works. I don t want to spend time with your software. I just want the data I need to run my business!
Flip Side: Poorly trained user community wanting turn key solutions. The incorrect people making the purchasing decisions. Poorly defined understanding of what the real problem is that they are trying to solve. Poor quality problem reports.
Where to start...
How can utilize the terabytes per hour that you are receiving? Define the needs closely as possible to match the needs of the business or situation Do data mining! There will be more that you can use Select the correct platform to do the processing at speed Understand all of the tools that are available do not limit yourself to one companies tools but do write in clauses that the software must work together or no one gets paid.
What is the most effective management of this big data? Play both ends against the middle! One end is the problem you are trying to solve. The other end is the report the end user needs. Build fast platforms that are correctly sized for the load. Limit the bottlenecks in the hardware. Have the correct people do the purchasing and use industry specialists.
SPEED, CORRECT PLATFORM, CORRECT FORM OF DATA BASE, CORRECT TOOLS for ANALYSIS and the CORRECT FORM OF THE REPORT
What are the most effective ways of understanding the ecological landscape of the data you are receiving? Start by understanding the types of data you are collecting. Then understand the tools available. For example: Object oriented vs relational databases which do you use and when do you use one or the other?
How do you determine new corporate strategic direction based on the data when the shape of the data itself is not clear? By defining the problem that you are trying to solve very tightly. Then you get the data which answers the questions.
How long do you keep the raw data? How much storage space do you have available and how fast are you getting the data? What are your storage processing speeds and how fast can you process the data that is available. Know where the bottlenecks are in the physical limitations of your hardware: For example: if you have a slow IO handler? Know the limitations in the way your database is designed: File vs table vs row/column locking! What about threading? When is the OS software going to start thrashing? What about speed of allocation of memory space? What are the legal requirements?
Real World Example: Internet broadcast of a science experiment: 8k users logged on a system designed for 2400 users with different businesses. RESULT Crashed every server in the system.
And what data will you dump? Everything you can! You will be getting more! Life/data runs in cycles. You will not hear or see the information only once. There are ways to back up the raw data and keep it for a number of years but do you REALLY need that data?
What about the limitations of the hardware of the various platforms and the network structure itself? Problem definition skills of decision makers. They do not define the needs of the business closely enough because they are not using the actual data. Do not understand sizing the volume of data properly so that the correct processing platform is selected. Do not understand what shape the final product needs to be in to be useful to the team.
Real World Example: Hospital System (50 hospitals) Wanted to have end users on PC s so selected a PC based system which could not handle the processing load. Decided on centralized servers without tiered support. Did not purchase enough servers. Did not distribute network load effectively. Did not provide enough training on the software to medical personnel.
Programmer Training Issues with the training of the programmers: Many do not understand how to write the software to use the hardware most effectively. AND they do not understand the stacking. AND they do not understand how to optimize the code to make the best use of the compilers.
Use an industry specialist!
What are the most effective ways of data-mining? Specialized software for the platform. Build the algorithms to determine if there are any random correspondences. Know what data you what to review. Build meta-data platforms whenever possible. Have the people doing the design and builds understand the shape of the data before they start!
Real World Example: Soft Drink Company in 122 countries: Need to understand peek load days for manufacture and distribution. Problem trying to address was concurrence when one country would have to support the overload of another. Meta-data critical to understanding and defining the shape of the data.
What about cross platform portability of the final product? Wolf Geiger (1992) - Data is only as good as the format in which it is presented to the person who has to use it. If it is not in a format that they can use there is no point in spending the time to do any of the processing.
Real World Example: Asked the end user to write down exactly what they wanted in the report. Asked the manager to write down exactly what they wanted in the report. Asked the computer programmer to write down exactly what the clients wanted in the report. Two of three matched. Which one did not?
Cell Phone Data: How should it be parsed? Has to be done on super computers to start based on the volume of the data but it has to end in PC formats! Object oriented db with full variable length fields. Needs Multi-dimensional processing: Computational linguistics. Analysis of word stressors. Analysis of grammatical syntax. Cognitive focus (topic basis). Recognized vocal stress vs topic. Risk factor assignment. Background noise assessment. Probability analysis of each of the factors to determine further review. Data presentation tools have to be in a format that is currently used that everyone understands where to look to find the important information. Cross platform portability!!!!
Questions?
Thank you!