I'm beginning a project that will gather data on the demography of American religion. I'm going to keep the transcribed data in CSV files and manage them in GitHub.
These are the principles that I'm thinking about following.
1. Keep the data in a repository separate from the code for visualizations. This way, the data will be useful to people apart from the specific visualizations. And then the data can be included in various projects as a submodule in Git.
2. Organize the raw, transcribed data by sources. I could guess what use the sources might have and organize them by denomination, for example. But it seems better not to have to make those judgment calls at the beginning, and just organize the data by source. That also makes it easier to manage citations.
(On this point, someday I might write a plugin for Omeka that will let an item, say a historical source with an attached PDF, also have a link to a GitHub file. That link would be exposed through the Omeka API, so that someone could get the raw file from GitHub through the Omeka site. Thus, GitHub could function as a back end for Omeka. But that's just an idea; for now it's more important to me to start gathering the data in useful ways.)
3. Include citation and explanatory data in separate files. It's possible to embed comments in CSV files, but there doesn't seem to be a standard. And when languages like R read CSVs, you have to specify what the comment character is or you get gibberish. It seems better to have a file like
lastname-1865.csv have a corresponding file named
lastname-1865.txt which would contain the citation and any necessary explanation of the fields in the CSV.
I also like Wayne Graham's suggestion on another thread to use Zotero ids. I'd have to decide whether GitHub + Omeka + Zotero is too many moving parts. And in any case, I'd want uses to get everything they need from a
4. Whenever possible, try to make the explanations join-able. It's a lot of work when CSVs have fields named
AHL0014 and you have to rename them without a simple join or merge.
Do these make sense? Are there better ways or other considerations?