Data Quality as a Dimension of Trusted AI

 

What Kinds of Data Sources Should I Use in Machine Learning?

Data provenance and integrity are essential elements of understanding your model. Depending on your use case, you might end up with a mix of the following:

  • Internal, private data to your enterprise
  • Open-source public data
  • Third-party data

It is very important to maintain a traceable record of the provenance of your data: what data sources were used, when they were accessed, and how they were verified. 

When using third-party or open-source data, you should find as much information as possible on how the data was collected, including its recency. This can inform whether the data is ultimately relevant to your use case. All these points also apply to your internal data, where you are likely to have increased transparency and might have an easier time identifying if there was any sampling bias introduced via the data collection.

How Do You Assess the Quality of Your Data?

The next step, before any modeling takes place, is exploratory data analysis, including data quality assessments. It is important to understand your data and explore what kinds of relationships may be present yourself. This includes the basics:

  • Computing summary statistics on each of your features
  • Measuring associations between features
  • Observing feature distributions and their correlation with the predictive target
  • Identifying any outliers

Other aspects of data quality assessment include missing or disguised missing values, duplicated rows, unique identifiers that do not carry any information for the model, and target leakage detection. Before deciding how to deal with missing data–for instance, whether certain rows or features should be excluded from modeling, and/or if a value should be imputed–you must understand if there were any systemic behaviors correlated with the missing values. For example, did specific locations of your retail chain habitually fail to report certain information? You might be able to request additional data and fill those gaps proactively.

Target leakage is a data quality issue unique to machine learning. It refers to the exposure of information to the model that it should not have at the time of prediction. This enables your model to “cheat” and seemingly perform better than it possibly could in production. For example, data from your historical records such as the levying of late fees on a loan would not be available to you when attempting to predict at the time of a loan application whether that applicant will make their payments on time. You should regard a feature with a high univariate correlation with the target with suspicion, though subject matter expertise is essential in identifying subtler kinds target leakage, which might be masked in a composite feature.

How Do You Improve Your Training Data?

Data cleaning includes the handling of missing values and outliers or inliers, along with the dropping of duplicate rows or columns. You might want to remove whole rows or features from the data due to the prevalence of missing values or outliers. Imputation might also be an appropriate method to handle missing values, though the type of imputation depends on the modeling approach used.

Enregistrer un commentaire

Plus récente Plus ancienne