There are two areas of interest in loading Activity data into Mint:
CSV Upload fields (proposed)Sept 6, 2012: These fields are to be finalised before implementation is completed. The information in this section is not yet implemented
The documentation below this section describes a CSV with the following fields:
It is proposed to alter the field set above to better capture local activities via a local activity harvest config (rather than an ARC/NHMRC harvest):
Loading Public Activity DataA default installation ships with sample activity data, but in order to go to production you are going to want to update this to include Activity records more relevant to your institution. The sample data comes from the NCGP spreadsheets and NHRMC spreadsheets, which are then manually massaged in Excel to prepare the datasets for machine ingest. The complete, ready to ingest CSVs (around 37,000 Activity records for all institutions) are available in their own repository. These files would normally be filtered to your local institution to create a production dataset. It is possible to ingest the full dataset if desired, but it is going to make your system a lot larger than typical, having an impact on backup/restore times with a larger storage layer in terms of items stored. The disk space impact is minimal. The sample data shipped by default is filtered to University of Newcastle (~900 Activities). All NCGP and NHMRC Activities are already available in Research Data Australia. ANDS pre-loads these Activity records, so Mint will never need to curate them or expose them via the OAI-PMH harvester. Integrating Local Activity DataIf you would like to generate RIF-CS Activity records for other project/grants at your institution it is fairly simple to bring a new data source online. Flinders University first tried this on early versions of Mint (v1.2/v1.3) and the current documentation below is a distillation of their efforts drawn from the mailing list: Thread 1, Thread 2. The details below are loosely in order of things you need to do, but there are also just some points of consideration. They have been extracted from the above threads to make life easier: 1) Data FormatFor the sake of simplicity we are going to mimic the data format of the existing public Activities. Customisations are very possible if you want to send additional data to RDA, but the following columns are recommended as a baseline, since they will let us reuse existing code with minimal effort: ID, Submit Year, Start Year, Title, Description, Institution, Investigators, Discipline
Again, for the sake of simplicity, drop your data in a CSV file alongside the existing Activity data in 2) Curation ChoicesBecause the public Activities are already present in RDA they never required curation, and as such sections of their configuration will need to be altered to accommodate. Before you start it is worth giving some thought to the broad intentions you have here, particularly if your institution has a non-trivial curation environment. As a rough guide, looking at how you are curating your Service records is probably I good fit, since both Parties and Collections have some potential complexities that are unique to just those types of RIF-CS records. 3) Data Source ConfigurationThe starting point for adding a new data source is the harvest configuration file, found in Now we are going to edit the config file to make some alterations related to our curation choice, along with other minor tweaks. In this case, we are using the 'Local' curation plugin, the very simple implementation found in the downloadable demonstration install. Altered and new values are highlighted in red: { "harvester": { "type": "csv", "csv": { "fileLocation": "${fascinator.home}/data/Activities_Other_Projects.csv", "idColumn": "ID", "recordIDPrefix": "redbox-mint.googlecode.com/activities/", ...<snip>... } }, "transformer": { "curation": ["local"], "metadata": ["jsonVelocity"] }, "curation": { "neverPublish": false, "alreadyCurated": false }, "transformerOverrides": { "local": { "template": "${server.url.base}published/detail/[[OID]]" }, "jsonVelocity": { "templatesPath" : "${fascinator.home}/templates/activities", "portalId": "Activities" } }, "indexer": { "script": { "type": "python", "rules": "Activities.py" }, "params": { "repository.name": "Local Research Projects", "repository.type": "Research Activities" } } }
There is one point worth noting in there as well that was a shortcut for us. Under ' 4) Indexing LogicAnother line in the configuration file we didn't change points at a Jython rules file in the same directory (' Since this was first done on the mailing list, a couple of changes in curation have removed one of the suggested alterations to this file, so all that remains is to add a few lines to the end of the ' def __basicData(self): self.utils.add(self.index, "repository_name", self.params["repository.name"]) self.utils.add(self.index, "repository_type", self.params["repository.type"]) # Persistent Identifiers pidProperty = self.config.getString(None, ["curation", "pidProperty"]) if pidProperty is None: self.log.error("No configuration found for persistent IDs!") else: pid = self.params[pidProperty] if pid is not None: self.utils.add(self.index, "known_ids", pid) self.utils.add(self.index, "pidProperty", pid) self.utils.add(self.index, "oai_identifier", pid) self.utils.add(self.index, "oai_set", "Activities") # Publication published = self.params["published"] if published is not None: self.utils.add(self.index, "published", "true") 5) RIF-CS TemplateAt this point you should be able run the harvest below and see the records in your system. There are just a couple of other configuration point that are relevant to your Activity records. The first is how RIF-CS entires are generated to be sent of RDA. Take a look in ' By default, you shouldn't really need to do anything to this file since it is already up-to-date. It can handle curation and finding the configured persistent identifier according to your system wide config. If however you want to enrich this template to send more detailed Activity records to RDA this is the place to do it. You may be interested in adding a block like this (which I stole from Parties Groups again) to add any related objects that the curation process is aware of: ### Relations #set($valueList = $item.getJsonSimpleList("relationships")) #if($valueList) #foreach($value in $valueList) #if ($value.getBoolean(false, "isCurated")) #set($key = $value.getString("empty", "curatedPid")) #set($rel = $value.getString("hasAssociationWith", "relationship")) #if ($key != "empty") <relatedObject> <key>$util.encodeXml($key)</key> <relation type="$util.encodeXml($rel)"/> </relatedObject> #end #end #end #end
If you have added any additional columns to your originally ingested data you are going to need to add them into this template as well to make the most use of them. 6) Portal VisibilityThis should already be taken care of, but if you are trying to modify your setup to do something different there are only two areas you want to be aware of:
Harvesting Your ActivitiesAssuming the values from the above example where used, your new data source should show up with all of the other data sources that the command line harvest tool can see. The examples below come from my Windows laptop, but they work the same under *nix, just substitute an appropriate path and ' c:\code\tf2\mint-builds\dev-local\server>tf_harvest.bat Usage: tf_harvest.bat jsonFile Where jsonFile is a JSON configuration file If jsonFile is not an absolute path, the file is assumed to be in: c:/code/tf2/mint-builds/dev-local/home\harvest Available files: Activities_NGCP_Completed Activities_NGCP_New_Ongoing Activities_NHMRC_2010 Activities_Other_Projects ANZSRC_FOR ANZSRC_SEO Funding_Bodies Languages Parties_Groups Parties_People RDA_Parties Services c:\code\tf2\mint-builds\dev-local\server>tf_harvest.bat Activities_Other_Project s -XX:MaxPermSize=256m -Xmx512m -Dsolr.solr.home="c:/code/tf2/mint-builds/dev-loca l/solr" -Dgeonames.solr.home="c:/code/tf2/mint-builds/dev-local/home/geonames/so lr" -Dhttp.proxyHost= -Dhttp.proxyPort= -Dhttp.nonProxyHosts="localhost" -Djetty .port=9001 -Djetty.logs=c:/code/tf2/mint-builds/dev-local/home\logs\jetty -Djett y.home=c:/code/tf2/mint-builds/dev-local/server/jetty -Dfascinator.home="c:/code /tf2/mint-builds/dev-local/home" -Dportal.home="c:/code/tf2/mint-builds/dev-loca l/portal" -Dstorage.home="c:/code/tf2/mint-builds/dev-local/storage" -Dserver.ur l.base="http://138.77.204.185:9001/mint/" -Damq.port=9201 -Damq.stomp.port=9202 -Dsmtp.host="localhost" -Dadmin.email="[email protected]" -Dredbox.version="1.5" c:\code\tf2\mint-builds\dev-local\server> If you want to follow the flow of the harvest process, the starting point would be the harvest client, which logs into a file called ' 09:56:45,594 solrwrapper DEBUG WrapperQueueConsumer === New buffer starting: 1338249405594 09:57:07,897 solrwrapper DEBUG WrapperQueueConsumer === Flushing old buffer: 22s 09:57:07,897 solrwrapper DEBUG WrapperQueueConsumer === Submitting buffer: 16 documents 09:57:09,600 solrwrapper INFO WrapperQueueConsumer Running forced commit! |