Big Data: AI Solutions to Intractable Problems (Part II)

Classic Big Data Analytics: Exploding Manholes and Fires in Illegal Conversions

Exploding Manholes

Consolidated Edison, “Con Ed,” an investor owned utility in New York City, has about 250,000 manholes in the City, about 50,000 of which are in Manhattan. The manhole covers are cast iron, a few inches thick, and 24 inches in diameter. Each weighs up to 300 pounds. Every year a few hundred catch fire, some of which explode up into the air. In 2007, Con-Ed went to the statisticians at Columbia University and asked them to identify which manholes were more likely to catch fire and explode, to try to find patterns in order to manage the problem.

Fires in Illegal Conversions

An illegal conversion is a one family house or apartment that houses three or four families or a community of 12 to 20 people. As noted above, New York City firefighters are 15 times more likely to die in fires in illegal conversions than in other types of residences. The FDNY knew that they could save lives by finding illegal conversions. But how? Using gut feelings and following their instinct inspectors had a 13% success rate in finding illegal conversions.

Finding the Needles in the Haystacks

Cynthia Rudin, statistician turned data scientist, led the team for Con Ed. They looked at “Trouble tickets” and found a mess. The term “Service Box” had 38 different variants, including “S, SB, S/B, S.B., SBX, S Bx, S Box, Serv Box, SERV/BOX, Service Box, etc. Database application designers should enforce quality control when and where the data are entered into the database. With 38 different terms for the same data point Con Ed’s data had no integrity.

But Cynthia and her team found 106 data points that they believed were reasonable predictors of a major manhole disaster. They condensed this to a smaller set of the strongest signals. By 2009 they were ready. They predicted problem spots. The top 10% of manholes on their list were 44% of the manholes that ended up with severe incidents.

Mike Flowers, hired by Mayor Bloomberg as the New York’s first head of Data Analytics, tackled the problem of tracking down illegal conversions. He and his team looked at records from the Dept. of Buildings, Housing Preservation, tax records, the NYPD. They looked at 911 calls, hospitalization data, reports on rodent infestations. They looked at construction and renovation permits, and Buildings violations because permits indicate careful and diligent homeowners and violations indicate issues. It took 2 years, but by 2011 they were ready. The methods were inexact, but the amount of data – on every residential property in New York City – compensated for the imperfections. The success rate finding illegal conversions went from 13% to 70%.

A House of Cards

In its early days Netflix subscribers would upload the titles of films that they wanted to watch and Netflix would mail a subscriber up to three (3) DVDs or Blu-Ray disks at a time. After watching the movies subscribers would mail the disks back. Netflix would then mail the next set of three movies.

Given the state of the art back then, we can assume that Netflix used a relational database like DB2, or MS SQL Server, or Oracle and the database contained information about subscribers and information about their stock of DVDs. The subscriber information was likely name, address, phone number, email, films watched, films out for watching, and films they want to watch. The information on movies may have been title, stars, director, genre, year made, number of copies on hand, number of copies with subscribers. After a while they probably started mining the database to determine most popular films, and trends.

In 2007 Netflix started streaming with its “Watch Now” service. To optimize this service they needed to store multiple copies of each movie in different data centers near their subscribers. Today Netflix is estimated to store 1,100 to 1,200 replicas1 of each film in different formats in order to stream to different screens, from TVs to computers, tablets, and phones, in different formats and in different resolutions, such as 4K and 1080P, across the world.

In 2011, Netflix took a big leap forward by leveraging information about the films and television shows subscribers watched to suggest what they might want to watch, asking questions like:

If Joe likes “Terminator,” and “Rambo,” would he like “Rocky”?

If Bob likes “Spy Game” and “Lara Croft” would he like “Mr. and Mrs. Smith”?

And how can we make money producing our own films?

Based on correlated viewing habits of their subscribers, Netflix determined that ninety percent of the 12 million people who liked the film “An American President” and television series “The West Wing” would watch one episode of “House of Cards” and, if it was compelling, most would watch additional episode. On February 1, 2013, Netflix premiered “House of Cards.” Their analysis was correct.

Search, Retail, and Social Media

Unlike Altavista, Ask Jeeves, Yahoo and its other competitors, Google approached Internet search from a data scientist’s perspective. Google’s engineers reasoned that just as the value of a scientific paper is in the number of other papers in which it is cited, the popularity of a website is the number of other websites that contain links to it. This worked, however, an unintended consequence paved the way for exaggerated, fictitious, or sensationalized stories designed to influence elections and political referenda. There are likely to be more searches for “Clinton and Monica” than “Clinton leads NATO against ethnic cleansing in Yugoslavia.”

Other examples of Big Data include Amazon using information on things people are looking for, purchasing, or streaming to recommend things they might want to buy or stream. Facebook and LinkedIn use information such as where users live, work, studied, and, of course, who they know, to identify other people they might know.

Correlating All These Data

While American Express probably evaluates terabytes2 of data each year, Cynthia Rudin at Columbia and Mike Flowers at the City of New York probably worked with much less data. However, they worked with multiple collections of data from different sources; with data that was not neatly organized in the tables of rows and columns of relational databases managed by DB2, Oracle, or SQL Server.

However, Rudin and her team didn’t do a concise statistical analysis. They didn’t analyze 1% or 2% of the 51,000 manholes in Manhattan and 250,000 manholes in the five boroughs. They looked at the entire dataset. The sample size was 100%. In statistical terms, “N = 100%” or “N = 1.”

Similarly, Flowers didn’t look at 1% or 2% of the roughly 4.4 million single family and two-family homes3 in NYC. He and his team looked at data on 900,000 units. Their sample size was roughly 20.45%.

This is a fundamental change in technique made possible by high capacity and inexpensive disk drives and high speed low cost computer memory, and graphics processing chips, GPUs, which are designed for simultaneous, or parallel execution of simple processes on large amounts of data.

As Viktor Mayer-Schonberger and Kenneth Cukier hammer home, science has been build looking for causation; using statistical samples of events to understand how and why they happen. Statistics are built on a representative sample. But things are different when your sample size is 100%. At that point you can look for correlations.

That’s what Cynthia Rudin and Mike Flowers did. They didn’t look for causation. They looked for correlation. And they found it. Netflix had a high degree of confidence that 11 million people would watch one or more episodes. Amazon and Apple realized that people who bought songs or CDs by certain artists, would buy songs or stream music by other artists performing in the same genre. American Express looks at purchases that make sense, and looks out for transactions that are out of sync. You can not use your credit cards in stores in Houston or San Francisco when you are in L. A. Correlation, not causation.

NOTES:

  1. Netflix Architecture. https://www.geeksforgeeks.org/system-design-netflix-a-complete-architecture/ ↩︎
  2. A “Petabyte” is a one thousand terabytes, one million gigabytes, one trillion megabytes, one thousand trillion kilobytes. For reference, one page of single space typed text may be four kilobytes, an image taken with a two to four megapixel digital camera, from 2010, will be roughly one megabyte. A petabyte, therefore could hold one trillion images or 250 trillion pages of text. ↩︎
  3. New York Housing Statistics, Info Please, viewed Sept 15, 2023. https://www.infoplease.com/us/census/new-york/housing-statistics ↩︎

Big Data: AI Solutions to Intractable Problems (Part I)

AI has become an overnight sensation. But like other overnight sensations, it has taken years to get there; 73 years since 1950 when Alan Turing described The Imitation Game, better known as the Turing Test:

“A computer would be said to be intelligent if and when a human evaluator, after a natural language conversation with the computer, would not be able to tell whether he or she was talking to another person or a machine.”

Microsoft has invested billions of dollars in ChatGPT, Open AI’s large Language Model, LLM. Microsoft is embedding ChatGPT into its search engine, and building and releasing “Copilots” for Word, Excel, and other software. Google announced “Bard,” its own LLM. Abnormal Technology, CrowdStrike, Egress, Riskified others are building AI into cybersecurity tools. Apple announced that it is embedding AI within iPhone, iPad, Mac. NVIDIA, which makes the chips and servers used in AI and was briefly the world’s largest company by market capitalization, leap frogging above Microsoft and Apple only to fall to $2.71 Trillion.

Today the relevant question is not, “Are these systems ‘intelligent’?” For the C-Suite, the questions is “What problems can we use AI to solve?” And for project managers the question is, “How do we plan and execute projects to leverage AI?”

Just as Project Managers in Information Technology have to understand networks, virtualization, and “The Cloud” in order to manage projects implementing or leveraging those technologies, and need to understand Waterfall and Agile for software engineering and infrastructure projects, we need to understand AI and Big Data in order to build and incorporate AI tools and manage Big Data projects. We don’t need to know the technical differences between Central Processing Units1, CPUs, like the Intel Xeon and AMD EPYC, and Graphics Processing Units2, GPUs, like the Nvidia RTX1, but we need to know that GPUs are used in Big Data and Machine Learning systems because they are designed for parallel processing; for simultaneous execution of large numbers of simple tasks such as rendering bit maps or comparisons of data points, such as representations of faces in an image recognition system.

The team members need to understand how to think analytically. They need to be creative, disciplined, and flexible enough to change their focus from causation to correlation, and to recognize patterns. It helps to have an understanding of statistics and the scientific method.

The simplest correlation works beautifully. Suppose you buy a ticket to fly from New York to LA. Two hours before the flight, you pay for a taxi and then, a few minutes later, baggage fees. Subsequently, you buy a book or magazine, a sandwich, and coffee in the airport. Hours later, after the flight, you pay for a taxi in LA and check into your hotel. The credit card company has a high degree of confidence that these transactions are legitimate. It knows you were traveling from New York to LA, knows you buy stuff in airports and use taxis to travel to and from airports. But if during the flight, or an hour after the plane lands in California, your card is used to attempt to buy something else on the east coast, or in any location far from LA, the credit card company will have a high degree of confidence that those charges are fraudulent. It will deny them or contact you.

American Express customers currently use their 114 million cards to purchase $1.2 Trillion of goods and services annually in 105 countries. That’s a tremendous amount of money, a staggering number of transactions, and an unbelievable number of opportunities for fraud. Amex was an early adopter of Big Data. In 2010, they began consolidating the knowledge gained in 150 different fraud detection models into one global model. They rolled this out in 2014 and made it their only model in 2015. Today, American Express’ fraud detection model delivers the best fraud detection and prevention statistics in the industry.

According to Anjali Dewan, VP of Risk Management at Amex:

“100% of our [fraud detection] models are AI-powered…. American Express for 13 years has come out as the lowest in the fraud space, and not by a little, but by half.” 3

Put another way, American Express has an army of virtual fraud investigators working around the globe and around the clock.

In “Big Data4,” Viktor Mayer-Schonberger and Kenneth Cukier explain how Consolidated Edison used “Big Data” to identify manholes likely catch fire and explode, sending 300 pound manhole covers into to sky, and how the City of New York wrestled with the fact that firefighters are 15 times more likely to die fighting fires in illegal residential conversions. The correlation was obvious. Prevent Fires in illegal conversions: save firefighters’ lives. In “Big Data @ Work5” Thomas Davenport describes how Netflix uses “Big Data” to know that “House of Cards” would be successful.

Davenport, Mayer-Schonberger, and Cukier describe the characteristics of “Data Scientists” – scientists who can code, hackers who understand the scientific method, who know how to develop and test a hypothesis, understand statistics, think independently, challenge assumptions, and find correlations, some of which may be non-obvious.

They distinguish “Big Data” from data warehouses and other decision support systems. It’s not simply the volume of data but more importantly the fact that data warehouse are internally derived and highly structured databases. Where data warehouses may be hundreds of gigabytes or terabytes managed in a relational database, a “big data” system may look at wildly unstructured data, multiply structured data or complex data some of which may be from external sources.

For example, telephone call records for a wireless carrier, transactions in a credit card processing system, multiple years accounting data would be contained in internal databases and managed by a relational database engine, an RDBMS. All the records are similar and highly structured; tuples6 of data with attributes like name, account number, phone number, address, etc. SQL, Structured Query Language, invented by Edgar F. Codd at IBM in 19707, is used to query the databases, whether managed by IBM DB2, Microsoft SQL Server, Oracle, or another database engine. These are internally managed proprietary databases.

Big Data, on the other hand, is unstructured or partially structured and may contain data that is external to the enterprise. Medical data, for example, will contain patient data, which can be managed within relational models, but will also contain diagnostic information, in the form of images, videos, hand-written notes, and the results of various tests, and these data may come from various sources.

Notes:

  1. CPUs are designed with a small number of cores, each with a relatively large amount of cache RAM, in order to run multiple applications, each with varying amounts of data, such as executing applications on a workstation or operating systems on a virtualization host. ↩︎
  2. GPUs are designed with a large number of cores, each with a relatively small amount of cache RAM, in order to simultaneously and repetitively execute the same process on a large number of data points. ↩︎
  3. Machine Learning Helps Payment Services Detect Fraud, AmericanExpress.com. How Amex Uses AI To Automate 8 Billion Risk Decisions (And Achieve 50% Less Fraud) John Koetsier, Forbes.com, Sept 21, 2020, ↩︎
  4. Mayer-Schonberger, Viktor, and Kenneth Cukier, Big Data, © 2013 ↩︎
  5. Davenport, Thomas, Big Data @ Work, © 2014, Harvard Business School Publishing Corp. ↩︎
  6. In database theory, a tuple is a single row in a table or index of a relational database. This comes from mathematics, where a tuple is an ordered list of mathematical objects, i.e. integers in a relation, such as coordinates in a Cartesian plane. ↩︎
  7. IBM Archives, “Edger F. Codd,” https://www.ibm.com/ibm/history/exhibits/builders/builders_codd.html ↩︎