Big Data: AI Solutions to Intractable Problems (Part I)

AI has become an overnight sensation. But like other overnight sensations, it has taken years to get there; 73 years since 1950 when Alan Turing described The Imitation Game, better known as the Turing Test:

“A computer would be said to be intelligent if and when a human evaluator, after a natural language conversation with the computer, would not be able to tell whether he or she was talking to another person or a machine.”

Microsoft has invested billions of dollars in ChatGPT, Open AI’s large Language Model, LLM. Microsoft is embedding ChatGPT into its search engine, and building and releasing “Copilots” for Word, Excel, and other software. Google announced “Bard,” its own LLM. Abnormal Technology, CrowdStrike, Egress, Riskified others are building AI into cybersecurity tools. Apple announced that it is embedding AI within iPhone, iPad, Mac. NVIDIA, which makes the chips and servers used in AI and was briefly the world’s largest company by market capitalization, leap frogging above Microsoft and Apple only to fall to $2.71 Trillion.

Today the relevant question is not, “Are these systems ‘intelligent’?” For the C-Suite, the questions is “What problems can we use AI to solve?” And for project managers the question is, “How do we plan and execute projects to leverage AI?”

Just as Project Managers in Information Technology have to understand networks, virtualization, and “The Cloud” in order to manage projects implementing or leveraging those technologies, and need to understand Waterfall and Agile for software engineering and infrastructure projects, we need to understand AI and Big Data in order to build and incorporate AI tools and manage Big Data projects. We don’t need to know the technical differences between Central Processing Units1, CPUs, like the Intel Xeon and AMD EPYC, and Graphics Processing Units2, GPUs, like the Nvidia RTX1, but we need to know that GPUs are used in Big Data and Machine Learning systems because they are designed for parallel processing; for simultaneous execution of large numbers of simple tasks such as rendering bit maps or comparisons of data points, such as representations of faces in an image recognition system.

The team members need to understand how to think analytically. They need to be creative, disciplined, and flexible enough to change their focus from causation to correlation, and to recognize patterns. It helps to have an understanding of statistics and the scientific method.

The simplest correlation works beautifully. Suppose you buy a ticket to fly from New York to LA. Two hours before the flight, you pay for a taxi and then, a few minutes later, baggage fees. Subsequently, you buy a book or magazine, a sandwich, and coffee in the airport. Hours later, after the flight, you pay for a taxi in LA and check into your hotel. The credit card company has a high degree of confidence that these transactions are legitimate. It knows you were traveling from New York to LA, knows you buy stuff in airports and use taxis to travel to and from airports. But if during the flight, or an hour after the plane lands in California, your card is used to attempt to buy something else on the east coast, or in any location far from LA, the credit card company will have a high degree of confidence that those charges are fraudulent. It will deny them or contact you.

American Express customers currently use their 114 million cards to purchase $1.2 Trillion of goods and services annually in 105 countries. That’s a tremendous amount of money, a staggering number of transactions, and an unbelievable number of opportunities for fraud. Amex was an early adopter of Big Data. In 2010, they began consolidating the knowledge gained in 150 different fraud detection models into one global model. They rolled this out in 2014 and made it their only model in 2015. Today, American Express’ fraud detection model delivers the best fraud detection and prevention statistics in the industry.

According to Anjali Dewan, VP of Risk Management at Amex:

“100% of our [fraud detection] models are AI-powered…. American Express for 13 years has come out as the lowest in the fraud space, and not by a little, but by half.” 3

Put another way, American Express has an army of virtual fraud investigators working around the globe and around the clock.

In “Big Data4,” Viktor Mayer-Schonberger and Kenneth Cukier explain how Consolidated Edison used “Big Data” to identify manholes likely catch fire and explode, sending 300 pound manhole covers into to sky, and how the City of New York wrestled with the fact that firefighters are 15 times more likely to die fighting fires in illegal residential conversions. The correlation was obvious. Prevent Fires in illegal conversions: save firefighters’ lives. In “Big Data @ Work5” Thomas Davenport describes how Netflix uses “Big Data” to know that “House of Cards” would be successful.

Davenport, Mayer-Schonberger, and Cukier describe the characteristics of “Data Scientists” – scientists who can code, hackers who understand the scientific method, who know how to develop and test a hypothesis, understand statistics, think independently, challenge assumptions, and find correlations, some of which may be non-obvious.

They distinguish “Big Data” from data warehouses and other decision support systems. It’s not simply the volume of data but more importantly the fact that data warehouse are internally derived and highly structured databases. Where data warehouses may be hundreds of gigabytes or terabytes managed in a relational database, a “big data” system may look at wildly unstructured data, multiply structured data or complex data some of which may be from external sources.

For example, telephone call records for a wireless carrier, transactions in a credit card processing system, multiple years accounting data would be contained in internal databases and managed by a relational database engine, an RDBMS. All the records are similar and highly structured; tuples6 of data with attributes like name, account number, phone number, address, etc. SQL, Structured Query Language, invented by Edgar F. Codd at IBM in 19707, is used to query the databases, whether managed by IBM DB2, Microsoft SQL Server, Oracle, or another database engine. These are internally managed proprietary databases.

Big Data, on the other hand, is unstructured or partially structured and may contain data that is external to the enterprise. Medical data, for example, will contain patient data, which can be managed within relational models, but will also contain diagnostic information, in the form of images, videos, hand-written notes, and the results of various tests, and these data may come from various sources.

Notes:

  1. CPUs are designed with a small number of cores, each with a relatively large amount of cache RAM, in order to run multiple applications, each with varying amounts of data, such as executing applications on a workstation or operating systems on a virtualization host. ↩︎
  2. GPUs are designed with a large number of cores, each with a relatively small amount of cache RAM, in order to simultaneously and repetitively execute the same process on a large number of data points. ↩︎
  3. Machine Learning Helps Payment Services Detect Fraud, AmericanExpress.com. How Amex Uses AI To Automate 8 Billion Risk Decisions (And Achieve 50% Less Fraud) John Koetsier, Forbes.com, Sept 21, 2020, ↩︎
  4. Mayer-Schonberger, Viktor, and Kenneth Cukier, Big Data, © 2013 ↩︎
  5. Davenport, Thomas, Big Data @ Work, © 2014, Harvard Business School Publishing Corp. ↩︎
  6. In database theory, a tuple is a single row in a table or index of a relational database. This comes from mathematics, where a tuple is an ordered list of mathematical objects, i.e. integers in a relation, such as coordinates in a Cartesian plane. ↩︎
  7. IBM Archives, “Edger F. Codd,” https://www.ibm.com/ibm/history/exhibits/builders/builders_codd.html ↩︎