How To Load Data With Nothing But Flint, Rocks, and Some Kindling

Tamr has some great solutions for getting data loaded. We have an official product solution, the Data Movement Service (DMS) that can be stood up alongside the classic Tamr package. We have df-connect, a java microservice authored by Tamr’s DataOps engineering team, that plays nice with Tamr’s native microservices and connects to a wide array of data sources. And we have the Tamr Python Client which, similar to df-connect, integrates seamlessly with both Tamr and a variety of sources.

(And if your data are small, in CSV format, and live on the same machine with which you browse the Tamr front end, the Tamr UI provides an excellent solution for data import, including preview. For more info, see the public docs.)

For whatever reason, however, you may come across a scenario where you want to load data, but want to minimize non-core-Tamr software to do so. Perhaps you want to use only Tamr’s core APIs with the “in-house software you already know and love.”

If so, here is the way to go about this purely with json data and Tamr APIs. Tamr docs show how to do the following via curl, but you could substitute in the language of your choice.

There are three endpoints you will have to hit:

  1. Create the dataset in Tamr via endpoint at as described here. You will need to tell Tamr at least (a) the name of the dataset and (b) the primary key (+ some optional parameters).

  2. Create the schema (all of the non-key attributes) via endpoint as described here. You will have to hit this endpoint 1x per attribute per dataset. You'll also want to keep the type info as in the example shown on the docs page (addAttribute.json), and simply alter the name of your attribute for each POST.

  3. Finally add your data via new line separated JSON as described here.

While we said “no Python!” for this example, if you are a Python user, (1) and (2) can be done in a single line of code using the Tamr Python client docs are here). We highly recommend the client as a way of interacting programmatically with Tamr, as it makes some more complex workflows (like the one described above) much more succinct. That said, Python is not required and there is some amount of overhead involved in getting set up. For the specific docs on creating a new dataset (including its schema) click here. Depending on whether you are using an empty or data-filled dataframe in Python, this single line of interaction via Python could actually replace (3) as well.

Finally, this is cheating a bit, but if you can place your data table header info into a 1-line CSV (i.e. all of your column names with separators, with the file named how you want your Tamr dataset named), you could use the UI's "add a new dataset" functionality to initialize a dataset in Tamr. Put another way, this would create an empty dataset with the desired schema, and you could then proceed with using the UPSERT endpoint from (3).