Data is the bedrock of AI and machine learning — so it only makes sense that at Transform 2020 we dedicated time to look under the hood and query some leading data experts about the trends they’re seeing in how companies are identifying the right data.
To that end, Dataiku VP of field engineering Jed Dougherty led a panel rounded out by Jaime DeLanghe, director of product at Slack; Hui Wang, VP of data science at PayPal; Dimitris Tsementsiz, senior quantitative researcher at Goldman Sachs; and Jacob Wilson, principal at PwC.
Dougherty started by asking each in the group to talk about something their own internal teams have had to deal with or new ways they’re developing of cleaning or manipulating data.
“We talk a lot about new predictive algorithms in the AI community,” Dougherty said, “but what other new techniques should data teams have in their quiver when they’re attempting to address issues with their data? For example, identifying and correcting cyclical trends in the data and ensuring the data is labeled accurately.”
DeLanghe began, explaining that a key challenge at Slack is that so much of their data is unlabeled data, and so they’re primarily sifting through behavioral data in the context of search. This means it can be quite complicated to understand what a user’s action actually means, and her team is focused on combating that ambiguity. “You take a whole slew of signals and then you try to predict based on not just text features, which are sort of traditional, but also behavioral features or other attributes of the message,” she said. “Maybe it has highlight characters. Is there an emoji?”
They’ve now started marrying click data with survey data in order to debug assumptions they may have in their models. “We’re going to look at how long did [a user] stay on the page, or maybe just hovered over it,” DeLanghe said. “We’re just asking people, as a starting place, to try to correlate those sort of second-order activities that will pull out both success cases and failure cases.” Or, as Dougherty put it, “So if you’re thinking about this from a learning perspective, you’re getting users to label themselves.”
At Goldman Sachs, Tsementsiz explained, its data is very often nonstationary, meaning you really can’t use all the data that’s available to you for a particular predictive task. As well, the danger of overfitting data arises, which can result in modeling errors due to a function being too closely tied to a limited set of data points.
“Suppose you’re interested in the volume traded for a financial asset or a stock, and you use the volume of the traded asset to predict something,” he said. “Now, if you take a daily average of the volume for a stock, you look at what happened yesterday. You take some average over the entire day and that is information that you only knew for certain at the end of yesterday, but this information is not actually going to be available to you today.”
PayPal’s Wang spoke about identifying the right data in order to strengthen fraud detection. A typical fraud detection system will decline payments if, for example, someone lives in New York and is purchasing something for an IP in Thailand. She explained how PayPal is using more data points to eliminate false positives or false negatives.
“What are the possible stories behind this behavior?” she said. “For example, I might be traveling in Thailand — say, staying in a hotel — and we can use our AI technology to create intelligence based on that IP. This IP is from Thailand, but it’s actually a resort, so it’s possible this person is traveling or is connected to some kind of a global company VPN.”
For PcW, most of its data is in document form: hundreds of millions of documents the company gathers each year, such as tax forms, lease agreements, purchase agreements, mortgage contracts, and syndicated loans, among others. Data extraction from documents like these requires extreme sensitivity to privacy and security concerns.
“Consistent, continuous learning pipelines has been key to help improve the information extraction models over time, and how we’re securing the data,” Wilson explained. “In certain cases, where we are trying to improve our models, by nature of the data we’re actually allowed to use, we might have to resort to one-shot learning where we have limited data in an area.”