A Proposal for the use of Artificial Intelligence in Spend-Analytics Mark Bishop, Sebastian Danicic, John Howroyd and Andrew Martin Our core team Mark Bishop PhD studied Cybernetics and Computer Science at the University of Reading. He is Professor of Cognitive Computing at Goldsmiths, University of London and between 2010-2014 was Chair of the society for the study of Artificial Intelligence and the Simulation of Behaviour (AISB), the largest Artificial Intelligence Society in the United Kingdom. He has published widely in areas of Artificial Intelligence, Machine Learning and Neural Computing. John Howroyd PhD studied Mathematics at Oxford University and University College, London. As well a being an expert mathematician John has published widely in computer science in particular in program analysis. He has also worked as Head of Research in a major project developing a Spend-Analytics system for NHS trusts. The particular problems solved for this system involved dealing with noisy and incomplete data. John and his team devised new techniques for automatically enriching this data in a structured way using external sources. He has in-depth knowledge of Bayesian Networks, classification and clustering methods and is also an experienced database engineer specialising in efficiency. Sebastian Danicic PhD studied Pure Mathematics at Queen Mary College, London, and Computer Science at University of Oxford and Imperial College London. He is Reader of Computer Science at Goldsmiths, University of London. He is a vastly experienced researcher with publications in Program Analysis, Theoretical Computer Science, Complexity of Algorithms and Software Watermarking. He is Director of the Program Analysis and Transformation Group at Goldsmiths. 1
Andrew Martin MSc studied Computer Science and Cybernetics at the University of Reading and Cognitive Computing at Goldsmiths, University of London under Mark Bishop. He is a current PhD Student at Goldsmiths, University of London researching Artificial Intelligence in the context of 4E s Cognitive Science, a Software Contractor, and the current Secretary of the AISB. Together we have broad experience of many aspects of Mathematics, Computer Science and Artificial Intelligence. We have had considerable success in working together as a team developing both new research ideas and deliverables to customers. Background In our document entitled The Centre for Intelligent Data Analytics: research goals of Dec 2013 we identified the following research areas where advanced Artificial Intelligence (AI) techniques can assist the delivery of medium and long term strategic goals for our partner s Analytics. Semantics At the heart of spend analysis is the general problem of forming an accurate, detailed semantic understanding of items from the raw text information that is available to the system (e.g. product descriptions). This data must be analysed using the existing knowledge base; there may, however, sometimes not be enough current context to unambiguously understand this data; in such circumstances it may be necessary to enrich information via additional user interaction and/or web spidering. To help solve such semantic issues there is scope for application of new AI techniques; for example, deep learning and reservoir computing and the newly emerging area of quantum linguistics 1 1 Maruyama reports: Quantum linguistics emerged from the spirit of categorical quantum mechanics, integrating Lambek pregroup grammar, which is qualitative, and the vector space model of meaning, which is quantitative, into the one concept via the methods of category theory. It has already achieved, as well as conceptual lucidity, experimental successes in automated synonymity-related judgement tasks (such as disambiguation). For a brief introduction see Jacob Aron, (2010), Quantum links let computers understand language, New Scientist December 2010. 2
Identification of similar suppliers and products Previous work by the team has already demonstrated the need to build contextually sensitive ontologies for product descriptions. These can aid both in the core classification of both products and suppliers. Improvements in this technology will lead to better identification of equivalent products; such improvements can be envisaged as applying in two distinct ways: 1. To [better] identify as the same a particular entity originally made by one manufacturer. 2. To [better] identify products that fulfil the same functional role but which are but subsequently sold on by different suppliers, made by different manufacturers. Clearly the abstract notion of equivalent functionality opens up further questions regarding the relative quality of one product as compared to another etc. Access to our partner s huge database offers exciting new opportunities for state-of-art data-mining and quantum linguistics to help make useful progress in this domain. Different learning algorithms for classification based on clean data Access to the our partner s database opens up new opportunities to research state-of-art machine learning techniques (e.g. deep learning 2 ; reservoir computing; echo-state networks) which potentially could also offer a significant improvement in classification performance. Automatic ontology generation Ontologies are structural frameworks for organising information and are used in artificial intelligence as a form of knowledge representation about the world (or some part of it). An ontology formally represents knowledge as a set of concepts within a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts. Automatically 2 Deep learning is part of a broader family of machine learning methods based on learning representations. A field (e.g. a product) can be represented in many different ways (e.g. different sentences), but some representations make it easier to learn tasks of interest (e.g. Is this drill the same as that one?) from examples. 3
developing contextually sensitive ontologies will significantly improve the classification system. Trend analysis: prediction of future price fluctuations To explore the use of our partner s database to identify economic trends in purchasing via the application of advanced machine learning techniques; the expectation is that with access to the huge database, new learning algorithms could be trained to make commercially useful time-series predictions (e.g. to highlight strategic opportunities for investment etc.). The Way Forward Informed by the demonstration, it appears that there are two separate, but inter-linked, pathways to be developed: 1. Spend-analytics for buyers (SAB) 2. Global Spend Analytics (GSA) GSA is an entirely new yet-to-be-specified system. Perhaps a good way to summarise the concerns of SAB is to provide a system for purchasing managers which allows them to find the most easily achievable savings from their data with least effort spent, so maximising their return on time invested; the team highlight some of the general considerations that a SAB system might address in Appendix 1. In the system demonstrated, the production of spend-analytics from the buyers perspective only uses transactions between the buyer using the system and its suppliers. (In each case, a tiny proportion of the all the data). All of the rest of the vast data set is ignored in performing this analysis. We term the calculation of price variance at the individual supplier level as local spend analytics as it pertains to a local subset of the total data set pertaining to a specific supplier. Local spend-analytics The system demonstrated by out partner at the meeting on April 17th demonstrated computation of local price variance on the same product 4
supplied by the same supplier; a product-identity relationship. Because the data possessed by our partner is clean, product-identity is a relatively straight-forward function to calculate requiring no application of Artificial Intelligence methods 3. It is clear, however, that if the current system is extended to include more general analysis the application of AI-based techniques cannot be avoided. E.g. As it is possible that different suppliers may use [subtly] different text to describe the same product, a simple identity relationship between text descriptors may no longer hold; in this case we need to class as the same text strings that are [by a suitable metric] similar. We note that even the relatively simple sounding task of comparing prices of the same product supplied to the buyer by different suppliers defines a problem whose solution is considerably more difficult than the example demonstrated. Global spend-analytics Global spend analytics, on the other hand, will take advantage of the whole dataset and other external data to allow us, for example, to observe general trends, and to predict strategic risks and opportunities. Although the use of artificial intelligence can improve local spend analytics - in the example we highlighted by allowing the application of a similarity metric for product identification - for global spend analytics, the use of advanced AI will be essential. Proposed improvements to spend-analytics Although the system demonstrated on April 17th highlights an immediate and exciting potential revenue stream for our partner, we are concerned that [future] competitors could realise similar functionality relatively easily. In this context the team have identified the following broad research pathways by which the Analytics might significantly, and non-trivially, be improved (we expand and appropriately outline these ideas in Appendix 2); furthermore we suggest that the use of appropriate advanced AI techniques could offer more clearly delineated intellectual property rights to our partner: 1. Real time processing 3 Because of the clean nature of the data it is likely that local product-identity can be established by a simple comparison 5
2. Better price variance analysis 3. Adding reporting dimensions 4. Knowledge enrichment from external sources 5. Clustering or classification of products 6. Improved search functionality 7. User behaviour to improve results 8. Modelling the market for better statistics 9. Trend analysis for predictive forecasting Concluding remarks If our partner seeks to fully monetise their data assets, more effective, deeper analytics will inevitably be required. In order to achieve this, some or all of the nine areas identified above need to be investigated (not least to develop and delineate long term intellectual property across the domain). It is in these areas that the team would seek to apply powerful new AI methodologies to leverage strategic advantage in the medium and long term. Appendix 1: Some considerations of spend analysis In our experience - in the context of spend analytics - the following kinds of issues are often of concern to purchasing managers: Price Variance The same product bought from the same supplier at various prices. This suggests areas where contracted prices may be considered. Supplier Consolidation The same product bought from differing suppliers at various prices. This suggests where preferred suppliers may help to reduce overall costs. Product Consolidation Differing products with the same functional role bought at various prices. This has many difficulties as it raises questions of quality and cost of utility but could result in overall reductions in expenditure. 6
Order Consolidation Products bought frequently in small amounts where savings could be achieved by placing fewer bulk orders. Contract Adherence Products bought off contract when one exists but at a higher price. This would require comparing invoice lines against a database of contracted pricing. Order Adherence Products supplied which were not requested. This requires matching invoices with the relevant orders where they exist and raising concerns with the supplier in good time. Peripheral Cost Savings Reducing peripheral charges such as VAT, Invoicing, Delivery, and Credit. Internal Cost Savings Reducing internal expenses such as Storage, Stock control, Cost of accounting, and delivery to point of use. Spend Forecasting Given current market trends what is the likely expenditure in the future for the various parts of the business. NB. It is unlikely that any spend analysis system can fully resolve these problems (as this will always require the application of problem specific knowledge and experience from purchasing managers), however by appropriately analysing current and past data, a strong spend analysis system can provide appropriate information to purchasing managers, from which they can make good purchasing decisions more easily. Appendix 2: Potential improvement pathways for The Analytics Real time processing There is a commercial advantage to report in real time for spend analysis. This will allow purchasing managers to raise concerns on particular invoices prior to payment being made. With appropriate consideration of the information architecture this can be achieved allowing incremental improvements to be reflected in the reporting as they arise. 7
Better price variance analysis As an example of the system s functionality we were shown how it reports potential savings to buyers on the same product supplied by the same supplier. At the prototype demonstration it appeared that the system effectively estimates potential savings by computing how much would have been paid if all products had been bought at the minimum price (and then subtracting that amount from that which was actually paid). The team remain concerned that such an approach may [at least occasionally] give rise to an exaggerated view of potential savings to buyers: for example, the data set may contain a single outlier, representing, say, a special offer out of many transactions with a much cheaper price and it will normally be unreasonable to use this singleton as a basis for comparing all other purchases. Furthermore, the price of items may fluctuate seasonally and it would be unreasonable to expect to pay the summer price for tomatoes in the winter. We suspect that if a SAB system merely highlighted variations from the minimum price, this feature might eventually be ignored by its users. We suggest that for customers to take price variation seriously a more sophisticated approach is required; one that can take all of the above factors into account. In much the same way that the Google page-rank algorithm gave rise to a better reflection of the importance of specific web pages (and hence prompted the long term shift of web search services from Alta-Vista to Google) we believe that a similarly clever algorithm for ordering possible savings could offer a much better reflection of the importance of individual price variation to the user. As soon as the SAB system is extended to include less specific analysis (e.g. the task of comparing prices of an identical product supplied to the buyer by different suppliers), the application of advanced artificial intelligence techniques (from areas such as quantum linguistics, data mining, machine-learning and clustering ) cannot easily be avoided 4. 4 E.g. Instead of simply reporting variances of minimum prices, more sophisticated algorithms could inform buyers which products were most likely to yield the largest savings (taking into account seasonal fluctuations etc.) and offer the user the chance to ignore outliers in performing the analysis. In addition, we suggest that inflation and other market forces should also be taken into account in presenting more accurate estimated potential savings to buyers. 8
Allowing buyers to add reporting dimensions It would also seem natural to allow the users to influence the overall reporting of possible savings. This could include, for example, the ability to up load cost centre codes, accounting codes, their own product classifications, or contract data (agreed pricing of products from various suppliers). In this context the team suggest investigating the extent to which AI technology could be used in a predictive manner to help reduce the burden of maintaining such dimensions as new product items are supplied or new suppliers are engaged. Knowledge enrichment from external sources The @UKplc SpendInsight system was designed to incorporate noisy data from many different sources; for example, order lines, supplier catalogues, contract databases, account systems, and the web; in this respect the clean database of invoices is now the base start point. Where there is information to support the underlying data this could also be linked. AI technologies (as deployed in the SpendInsight system) can offer mechanisms to do this in a way that keeps the data sources distinct and thus enables complete control over what is shown to specific users. Clustering or classification of products Clustering is essential in useful spend-analytics. The task of moving from identical to similar items is very difficult and requires a variety of techniques many of which fall under the general heading of artificial intelligence. For example a SAB system may be required to perform a more general analysis about pens. In order to do this, we need to find all products in our system which come under that category. This is a very hard task and can never be performed to 100% accuracy except with very small data sets. In fact in order to help solve the problem we may have to look outside our local dataset possibly even resorting to spidering the Web in a search for hints about how to classify products whose internal descriptions are not sufficiently helpful. Without clustering, the same product supplied by a different supplier will be regarded as a different product; to identify them as the same is a very difficult problem. When are two products produced by different suppliers in fact the same? This sort of question is solved using algorithms from artificial intelligence and can only be answered probabilistically. Furthermore clustering is essential whenever we want to ask questions in a more general 9
way. Without clustering, we may be able to ask questions like How is supplier X performing this month? but if we want to ask questions like: How is supplier X performing this month compared to other similar suppliers? things become much more complex. We need to be able to find ways of clustering similar suppliers. Presumably, inter-alia, similar suppliers sell similar products. Deciding if two different products are similar, however, is an even more difficult problem than deciding whether they are identical. This problem may require the use of external data produced as the result of spidering and state of art semantic text analysis such as quantum linguistics. Improved search functionality A nice feature demonstrated was the search functionality when filtering the result set by product. However, this relied purely on selecting products containing the search terms in their invoice descriptions. The system has many possibilities for improvement but these require a degree of semantic understanding (e.g. that the word transit should be treated synonymously with carriage ). The search functionality would also be improved by using enriched data from external sources such as fuller product descriptions from supplier catalogues. Similarly a fine-grained clustering of classification of products could be used to broaden searches over specific types of product. We see this as a series of incremental steps to provide the buyers with the search functionality that they require. User behaviour to improve results We can also add knowledge by analysing user supplied data and user behaviour. User supplied data allows for a more tailored interface to the user, but also when aggregated across all users gives semantic information from the human perspective which may be leveraged in many ways. Similarly user behaviour can also be mined providing an important feedback loop for the relevant learning algorithms. For example, noting which products are most frequently grouped together for comparison gives an additional mechanism for addressing which products are similar. This can then be used to adjust the parameters for the ranking of products in a search. 10
Modelling the market for better statistics The data for one product item from one supplier to one buyer is generally so sparse that accurate analysis and predictions are not possible (a problem statisticians might call over-fitting). Of course, something is probably better than nothing, but the value will be limited and without care expectations could be artificially raised. The notion of similarity as discussed under clustering or classification of products allows for hierarchical modelling of products. High level groupings with lots of members have lots of data and smoother behaviour giving rise to better models. Lower level groupings have fewer members and less data, but their models should be influenced (and smoothed) by the models of the higher level groups to which they belong. This enables better predictions to be made at these lower levels by allowing influence from above Trend analysis for predictive forecasting The market modelling will enable comparative analysis of the various products and product groupings. Thus building up a network of correlations over the market place, with strong correlation between much more related parts of the market but also some which are more distant (and perhaps unexpected). Temporal properties may also be examined; for example where growth in one is usually followed by growth in another. This together with standard time series techniques should provide a rich toolbox for trend analysis and predictive forecasting. Contact Us If you have follow up questions contact Andrew Martin at a.martin@gold. ac.uk who will answer your question directly, or pass it on to the other members of the team. We look forward to receiving your enquiries. 11