What are the best practices for data curation in GitHub? « Digital Humanities Questions & Answers

Digital Humanities Questions & Answers » Applications, Tools, Formats

What are the best practices for data curation in GitHub?

(3 posts) (2 voices)

Asked 2 years ago by lmullen
Latest answer from Ben Brumfield

Tags:

lmullen
Member
I'm beginning a project that will gather data on the demography of American religion. I'm going to keep the transcribed data in CSV files and manage them in GitHub.

These are the principles that I'm thinking about following.

1. Keep the data in a repository separate from the code for visualizations. This way, the data will be useful to people apart from the specific visualizations. And then the data can be included in various projects as a submodule in Git.

2. Organize the raw, transcribed data by sources. I could guess what use the sources might have and organize them by denomination, for example. But it seems better not to have to make those judgment calls at the beginning, and just organize the data by source. That also makes it easier to manage citations.

(On this point, someday I might write a plugin for Omeka that will let an item, say a historical source with an attached PDF, also have a link to a GitHub file. That link would be exposed through the Omeka API, so that someone could get the raw file from GitHub through the Omeka site. Thus, GitHub could function as a back end for Omeka. But that's just an idea; for now it's more important to me to start gathering the data in useful ways.)

3. Include citation and explanatory data in separate files. It's possible to embed comments in CSV files, but there doesn't seem to be a standard. And when languages like R read CSVs, you have to specify what the comment character is or you get gibberish. It seems better to have a file like lastname-1865.csv have a corresponding file named lastname-1865.txt which would contain the citation and any necessary explanation of the fields in the CSV.

I also like Wayne Graham's suggestion on another thread to use Zotero ids. I'd have to decide whether GitHub + Omeka + Zotero is too many moving parts. And in any case, I'd want uses to get everything they need from a git clone.

4. Whenever possible, try to make the explanations join-able. It's a lot of work when CSVs have fields named AHL0014 and you have to rename them without a simple join or merge.

Do these make sense? Are there better ways or other considerations?
Tweet this question
Posted 2 years ago Permalink
lmullen
Member

I thought of one more principle:

5. If it is necessary to combine or transform the data (e.g., into tidy data), then the transformation should be scripted, and all the scripts should be run from a make/rake file. But the transformed data should be committed in Git, so that users can get to the data without having to run the transformations themselves.

Posted 2 years ago Permalink
Ben Brumfield
Member
Andrew Torget and I took the approach you describe in #1 and #2 in the Digital Austin Papers, creating an organization for the overall project and separate repositories for the data (XML transcripts) and the presentation code (PHP/MySQL/Javascript).

In our case this was motivated by the desire de-couple data from presentation because 1) we might re-write the digital edition software later on a totally different system, and 2) we felt that data re-use and software re-use would each be hindered by coupling the data to the platform.

Our project was composed of hand-transcribed XML files that we had to transform into TEI-P5-compliant XML. As a result, we structured the data repository as follows:
1. source_xml The hand-coded transcripts (source files wanted to preserve history for).
2. teip5_xml Programmatically generated TEI-P5 XML (which we envision scholars re-using)
3. scripts The perl and ruby scripts used to transform the source files into the TEI-P5 files and run validation against.
4. reference The TEI-P5 DTD that the validation scripts check against
I'm pretty pleased with the results, but would welcome other suggestions. I look forward to seeing what you come up with.
Posted 2 years ago Permalink

RSS feed for this topic

Reply

You must log in to post.