Problems with Indexing Datasets like Web Pages

January 28, 2020

Google has created a dataset search for researchers or the average person looking for datasets. On the one hand, this is a cool idea. Datasets are hard to find in cases, and this ostensibly makes the datasets and accompanying research easier to find.

In my opinion this dataset search is problematic for two main reasons.

1. Positioning Google as a one-stop-shop for research is risky. #

There’s consistent evidence that many people (especially college students who don’t work with their library) start and end their research with Google, rather than using scholarly databases, limiting the potential quality of their research. (There’s also something to be said here about the limiting of access to quality research behind exploitative and exclusionary paywalls, but that’s for another discussion).

Google’s business goal of being the first and last stop for information hunts makes sense for them as a company. But such a goal doesn’t necessarily improve academic research, or the knowledge that people derive based on information returned from search results.

2. Datasets without datasheets easily lead to bias. #

The dataset search is clearly focused on indexing and making more available as many datasets as possible. The cost of that is continuing sloppy data analysis and research due to the lack of standardized Datasheets for Datasets (for example) that fully expose the contents and limitations of datasets.

The existing information about these datasets is constructed based on the schema defined by the dataset author, or perhaps more specifically, the site hosting the dataset. It’s encouraging that datasets have dates associated with them, but I’m curious where the description for the datasets are coming from.

Only the description and the name fields for the dataset are required before a dataset appears in the search. As such, the dataset search has limitations. Is the description for a given dataset any higher quality than the Knowledge Panels that show up in some Google search results? How can we as users independently validate the accuracy of the dataset schema information?

The quality of and details provided in the description field vary widely across various datasets (I did a cursory scan of datasets resulting from a keyword search for “cheese”) indicating that having a plain text required field doesn’t do much to assure quality and valuable information.

When datasets are easier to find, that can lead to better data insights for data analysts. However, it can just as easily lead to off-base analyses if someone misuses data that they found based on a keyword search, either intentionally or, more likely, because they don’t fully understand the limitations of a dataset.

Some vital limitations to understand when selecting one for use in data analysis are things like:

What does the data cover?
Who collected the data?
For what purpose was the data collected?
What features exist in the data?
Which fields were collected and which were derived?
If fields were derived, how were they derived?
What assumptions were made when collecting the data?

Without these valuable limitations being made as visible as the datasets themselves, I struggle to feel overly encouraged by this dataset search in its current form.

Ultimately, making information more easily accessible while removing or obscuring indicators that can help researchers assess the quality of the information is risky and creates new burdens for researchers.