Problems with Indexing Datasets like Web Pages

Google has created a dataset search for researchers or the average person looking for datasets. On the one hand, this is a cool idea. Datasets are hard to find in cases, and this ostensibly makes the datasets and accompanying research easier to find.
In my opinion this dataset search is problematic for two main reasons.

1. Positioning Google as a one-stop-shop for research is risky.

There’s consistent evidence that many people (especially college students who don’t work with their library) start and end their research with Google, rather than using scholarly databases, limiting the potential quality of their research. (There’s also something to be said here about the limiting of access to quality research behind exploitative and exclusionary paywalls, but that’s for another discussion).
Google’s business goal of being the first and last stop for information hunts makes sense for them as a company. But such a goal doesn’t necessarily improve academic research, or the knowledge that people derive based on information returned from search results.

2. Datasets without datasheets easily lead to bias.

The dataset search is clearly focused on indexing and making more available as many datasets as possible. The cost of that is continuing sloppy data analysis and research due to the lack of standardized Datasheets for Datasets (for example) that fully expose the contents and limitations of datasets.
The existing information about these datasets is constructed based on the schema defined by the dataset author, or perhaps more specifically, the site hosting the dataset. It’s encouraging that datasets have dates associated with them, but I’m curious where the description for the datasets are coming from.
Only the description and the name fields for the dataset are required before a dataset appears in the search. As such, the dataset search has limitations. Is the description for a given dataset any higher quality than the Knowledge Panels that show up in some Google search results? How can we as users independently validate the accuracy of the dataset schema information?
The quality of and details provided in the description field vary widely across various datasets (I did a cursory scan of datasets resulting from a keyword search for “cheese”) indicating that having a plain text required field doesn’t do much to assure quality and valuable information.
When datasets are easier to find, that can lead to better data insights for data analysts. However, it can just as easily lead to off-base analyses if someone misuses data that they found based on a keyword search, either intentionally or, more likely, because they don’t fully understand the limitations of a dataset.
Some vital limitations to understand when selecting one for use in data analysis are things like:
  • What does the data cover?
  • Who collected the data?
  • For what purpose was the data collected?
  • What features exist in the data?
  • Which fields were collected and which were derived?
  • If fields were derived, how were they derived?
  • What assumptions were made when collecting the data?

Without these valuable limitations being made as visible as the datasets themselves, I struggle to feel overly encouraged by this dataset search in its current form.

Ultimately, making information more easily accessible while removing or obscuring indicators that can help researchers assess the quality of the information is risky and creates new burdens for researchers.

Power to the Microbes!

Modern Farmer has an article about the “next green revolution”:

“A lot of materials used in corporate agriculture have the capacity to enhance plant growth and performance, but they suppress soil biology,” he says.

The scorched-earth tactics he’d employed with his pesticides and herbicides, he realized, had worked all too well. The microbial life critical to healthy soils had become collateral damage. Afterwards, in a best-case scenario, Kempf could coax his cantaloupes and other crops to acceptable yields only by practically drowning them in fertilizer. He threw this approach out the window. Instead, by focusing on creating healthy soils, he’d let plants do what plants have evolved to do best when they’re given a fighting chance: grow like crazy.

By focusing on the base biology of his farm–the soil–this farmer, John Kempf, could increase yields and the health of the crops. Essentially, with healthy soil, you can grow hardier, stronger plants. As he determined, wholesale applications of pesticides would lead to temporary successes against pests that attack his plants, and copious amounts of fertilizer could coax the plants into growing larger and better, but once he turned his attention to the biology of the soil that his plants were growing in, the plants grew better than ever before.

Essentially, with healthy microbial life in soil, plants can grow bigger and stronger.

A class of bacteria commonly found in the guts of people—and rodents—appears to keep mice safe from food allergies, a study suggests. The same bacteria are among those reduced by antibiotic use in early childhood. The research fits neatly into an emerging paradigm that helps explain a recent alarming increase in food allergies and other conditions, such as obesity and autoimmune disease, and hints at strategies to reverse the trend.

By focusing on the base biology of humans–the gut–this scientist, Cathryn Nagler, and her research team have found a way to prevent food allergies from developing. With a healthy gut microbiome, your body is better equipped to stay healthy and fight off diseases. While antibiotics kill bacteria that cause diseases, they also kill good bacteria that keep us healthy. This research is another step toward recognizing that people can better control and stabilize their health by cultivating a healthier gut.

Healthy soil leads to strong, healthy plants. Healthy gut leads to strong, healthy humans. Power to the microbes! 

Hobby Lobby, Facebook, and SPORTS

This week’s super important great big news:

The Supreme Court ruled that Hobby Lobby and other private, closely-held companies can use religious belief as a reason to deny coverage of certain contraceptives for employees.

Here is the decision described in plain english by SCOTUSBlog: “The families that own Hobby Lobby and Conestoga Wood Specialties are deeply religious and do not want to make four of those twenty kinds of birth control – IUDs and the “morning after” pill — available to their female employees because they believe that it would make them complicit in abortion.  Today the Court agreed that they don’t have to.”

Here are some quotes from Ruth Bader Ginsburg’s dissent, which starts on page 60 of the PDF of the Supreme Court decision linked above.

For a personal reaction, The Hairpin has republished a great personal essay/history about the importance of bodily autonomy for women.

The New Yorker calls on history to identify the Hobby Lobby case (and the Harris case, about required union contributions) as the latest representation of a trend the Supreme Court has been following for years:
“in confronting a politically charged issue, the court first decides a case in a “narrow” way, but then uses that decision as a precedent to move in a more dramatic, conservative direction in a subsequent case.”

An additional New Yorker article does a great job of addressing these potential threats in further depth:
“Women’s health is treated as something troublesome—less like other kinds of health care, which a company should be asked to pay for, than as a burden for those who have to contemplate it. That is bad enough. But the Hobby Lobby decision is even worse.”

Women, the Web, and the App Takeover

Here’s what was important this week…

Today is Pi day. Here is more than you probably ever wanted to know about pi day.

Last Saturday, March 8 was International Women’s Day. Started as a revolutionary holiday to honor the achievements of women, International Women’s Day is recognized in many countries. However, in Nepal it is recognized by women only, rather than as a day where men pay tribute to the women. Nepal also has another holiday that only women observe:

“In early September in Nepal, Hindus – who make up 81 per cent of the country’s 30.5 million people – celebrate Rishi Panchami, a festival that commemorates a woman who was reborn as a prostitute because she didn’t follow menstrual restrictions. It is a women’s holiday, and so Nepal’s government gives all women a day off work. This is not to recognise the work done by women, but to give them the time to perform rituals that will atone for any sins they may have committed while menstruating in the previous year. (Girls who have not begun menstruating and women who have ceased to menstruate are exempt.)”

However, the interesting thing about a cultural distaste and monthly banishment that occurs surrounding menstruation, is that “they talk openly – more openly perhaps than the average teenage girl in the UK might – about what they use for sanitary protection. Some use sanitary pads, some are happy with cloths, although they dry them by hiding them under other clothes on washing lines.”

