Jane! Your Machine Learning Model Has Ruined Me
Meet Jane
Jane Doxney is a young, dynamic, brilliant, and intellectually sound individual, who rose quickly from being a machine learning enthusiast to a seasoned machine learning expert. In an instant, she had become a star of internationally recognized competitions and had received several awards and accolades in client behavior, satisfaction, and intention modelling
Jane and BrownTech
Jane caught the eye of Brown Foxy, an investor and entrepreneur, who was keen on investing in the next big thing. Foxy had re-directed all his investments to his new start-up — BrownTech and had engaged Jane to help lead the data-driven arm. It was all rosy from the word — “go” for the company and Jane. Jane’s department had established itself as a dominant force within the Organization. The company was looking up to it for every single business and investment decisions.
Jane’s Machine Learning Prediction
In the last quarter of the first year, BrownTech procured a consumer-focused big data from a data broker, to help it understand and predict consumer behavior towards its next big move. Jane modelled the data and predicted a sudden surge in customers’ behavior that will result in an increasing demand for services. BrownTech, as a data-driven Organization, leveraged on this modelling prediction by making a huge investment based on it.
The Outcome
The outcome of this move was ineffable. Just in 2 days, the company had lost seventy-five percent of its investment capital and was struggling to survive. In a few weeks, It had downsized and close-up some of its key departments. The aftermath of this was debilitating on Foxy; he was hospitalized multiple times due to the shock of losing close-to-all of his investment in a twinkle of an eye.
Jane’s Reflection
For Jane, it was a time of deep reflection. A time to question everything that has ever been and was. She was interested in knowing and identifying what would have gone wrong, so she commenced an inquiry. She involved Bob Dudney, to serve as an independent party in the inquiry. They commenced the inquiry by double-checking the modelling techniques and processes and found it OK. With intense inquiry, Jane and Dudney realized the data she had modelled for the investment decision was flawed in all respect of quality. It was found to have violated most of the fundamental elements of good quality data. It was a moment of lost and found truth for Jane; a moment she realized her knowledge and depth about data and modelling has been skewed. Skewed more towards data modelling, and little or nothing towards the data itself. She realized she has been poorly schooled in the subject of data quality system, and every concept relating to it.
Summary of Jane’s Story
This is the story of Jane Doxney, that did the right model prediction on the wrong data.
Why We Should Care About Jane’s Story
Ok! never mind about what happened to Jane, it is all fictional, and not based on a true story. However, the predicament of Jane and Foxy is a true reflection and representation of the impact of data quality on investments, economy, and business. Most significantly, it is a reflection of the widening gap between relevance placed on data modelling, and data quality. The euphoria of the emergence of data science, machine learning, and artificial intelligence is majorly domain on data use and modelling; and less importance on data quality, and data quality assurance system set-up.
In this era of data economy (The production and sale of huge and massive anonymous data by Banks, Telecommunication, Social media platforms, etc to Middlemen and Organizations) Start-ups, and established Companies alike, gets into the rush of securing, analyzing, and modelling the latest big data for their next big investment decisions and actions. Neglecting, quality evaluation for reliability, integrity, and timeliness of data test. This is reflected in the words of Mike Davie —
“Middlemen and data aggregators buy and sell all this data on data marketplaces. But such high volumes of data are being bought and sold every day, going through multiple levels and exchanging hands so many times that it can become hard to ensure which data is original and which has been tampered with along the way.”
This problem calls for urgent attention. The need for the increased feasibility of data quality in the current scheme of things is one that should happen quickly. The importance of data quality can never be re-emphasized. Poor quality data has been found to cost the U.S economy 3.1 trillion dollars, this is far more than the total value of the African economy as a whole. Annually, it accounts for 9.7 million dollars in losses to Organizations. Also, good data quality has been found to increase companies’ revenue by about 70%.
How To Start With Ensuring Data Quality
Ensuring data quality system set-up feasibility in this era of big data requires including it in all conversations consistently and intentionally. At a minimum, it must start with Data Quality Assurance (DQA). DQA is the scientific process of assessing and evaluating the credibility, reliability, and overall quality of data. It involves ensuring that the following fundamental data quality elements are embedded into any Data quality system set-up initiative:
1. Accuracy: This element of data quality evaluates the data and ask the question — ‘Does this data correctly and truly describes the ‘real world’ situation it is meant to reflect’. Data accuracy is a measure of the degree to which the data is the true representation of the event, demographics, etc it is meant to reflect.
2. Timeliness: This element of data quality system evaluates and ask the question — ‘ Is this data time-relevant, and does it capture the real situation of the time ’ Timeliness focuses on ensuring that the data is used at the right time and not outdated for use. It ensures data is in tune with its timely realities and reflects the changing dynamics concerning time.
3. Precision: This element of data quality evaluates and asks the question — ‘Does this data have the required adequate detail about the situation being described’ Precision is a measure of the degree to which data can be dis-aggregated to the least micro-level of use. E.g High Precision is being able to dis-aggregate contraceptive use data by gender (male and female), age, marital status, etc. The precision evaluation must look at dis-aggregation relevant to investment, and decision goals.
4. Completeness: This element of data quality evaluates and asks the question — ‘Does this data capture all the units, persons, facilities, countries, etc it was meant to capture or cover’ It is a measure of the scope and coverage of the data against the set target. E.g If it is meant to capture all the 54 countries of Africa; does it have data for all or some. The completeness of data determines its generalization power.
5. Reliability: This element of data quality evaluates and asks the question — ‘Was this data gotten using the standard procedures and tools, and does it follow laid down data collection protocols, instructions, and rules’ Reliability test must include evaluating procedures laid down to addressing data quality issues. This element is also referred as to as consistency.
6. Integrity: This element of data quality evaluates and asks the question — ‘Has this data not been deliberately tampered with for personal, economic, financial, or political gains? Does this data not portray any bias caused by human manipulation.’ This element assesses the data and ensures it is void of any deliberate contamination. This element is most relevant during the process of procuring big data for use.
7. Confidentiality: This element of data quality evaluates and ask the question — ‘Are there procedures or protocols in place to ensure that data security and protection are ensured, and data are not disclosed inappropriately.’ This element is more relevant when data contains personalized and highly sensitive information and details. Information such as names, bank account details, etc
Now that’s it. Hope you find this useful? Please drop your comments and follow me on LinkedIn at Ayobami Akiode LinkedIn
References
[1]: Geoff Zeiss. (May 7, 2013). Estimating the economic and financial impact of poor data quality. https://geospatial.blogs.com/geospatial/2013/05/estimating-the-economic-impact-of-poor-data-quality.html
[2]: Mike Davie (April 15, 2019). Why Bad Data Could Cost Entrepreneurs Millions. https://www.entrepreneur.com/article/332238
Gain Access to Expert View — Subscribe to DDI Intel
Jane! Your machine learning prediction has ruined me was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.