Since 2010, FundApps has been making compliance simple by providing a client-focused service to automate monitoring of regulatory requirements. With offices in London, New York and Singapore, the company monitors over USD 10 trillion in AuM with 750+ users from compliance teams at asset managers, hedge funds, pension funds and investment banks around the world.
FundApps processes market data on behalf of their clients. Often, market data values differ between issuers and leads to clients making mistakes. To help, FundApps required an intelligent software solution to distinguish the correct market data values from all data collected for a given period of time and to immediately detect changes reported by the clients. This would ensure quality and reliable data to enable client’s to make data-driven decisions.
One of FundApps' services relates to the management of shareholding disclosure. As part of this, market data is not always up-to-date due to a human error or service failure. Therefore, a large percentage of the data contained invalid values for a certain period of time. Different providers reported different market data for the same company during the same time period. The goal was to detect the correct value out of many. We accepted the challenge and immediately started with research and prototypes to find a software solution to the problem.
Initially, we made sure we get as much data for processing as we can. Data was a time series of market data. Using tools for predictive data analysis like scikit-learn, (built on NumPy, SciPy, and matplotlib), SciPy, and pandas, we started with plotting and visual data analysis. We concluded that this type of data is mostly suited for unsupervised learning algorithms.
Then we proceeded with pre-processing, transformation and normalization of the data. Once all set, we started fitting this data to many unsupervised machine-learning clustering algorithms. Using DBScan algorithm, with precomputed distance matrix, we successfully detected new clusters. Applying Pareto principle algorithm to all clusters, we were able to accurately predict correct daily value.
The main assumption we made is that data can be incorrectly reported either because of human error, or because the different market data was not synced and sometimes correct values were reported with the delay of a few days, or in some cases even longer. Therefore, incorporating Pareto principle (stating that roughly 80% of the effects come from 20% of the causes) when creating new clusters, we were able to accurately predict a change in value.
This solution has shown that using DBScan and Pareto principle, we are able to significantly improve detection of the market data changes.