Ingest Cookbook (QueryGenerator)
Following the concepts on the different means of ingesting the data, we will build an example using the DataModel method in this notebook.
We will use the Cookbook dataset to have the data persisted onto ApertureDB instance.
Additional scripts used.
create_nested_json.py
: The script merges the first 3 sheets on the source into a json file such that there will be a list of Dishes objects, and each Dish may have multiple ingredients, and each ingredient has miscellaneous properties. This ends up as a json file.
Connect to the dataabse.
- If you haven't already setup the database or configured it, check out our quick start guide
# Install the required packages
%pip install --upgrade --quiet pip
%pip install --upgrade --quiet aperturedb
Needed resources.
# Get the script to generate the data.json
!wget https://github.com/aperture-data/Cookbook/raw/refs/heads/main/scripts/create_nested_json.py
# Run the script to generate the data.json
!python create_nested_json.py
Ingest using a Query Generator
When the source data is in a format that does not conform to any of the CSV pasrsers in the SDK, we could use the approach of defining a custom Query generator.
This does require a level of familiarity with the Query language.
Let's implement a class to deal with the cookbook example.
The Query generator is used to define a getitem
to return a query to issue to ApertureDB that persists the record being iterated at the source.
from typing import Tuple
from aperturedb.CommonLibrary import create_connector, execute_query
from aperturedb.QueryGenerator import QueryGenerator
from aperturedb.types import *
from aperturedb.Sources import Sources
import json
from tqdm.auto import tqdm
class CookBookQueryGenerator(QueryGenerator):
def __init__(self, *args, **kwargs):
super().__init__()
assert "dishes" in kwargs, "Path to Dishes must be provided"
with open(kwargs["dishes"]) as ins:
self.dishes = self.dishes = json.load(ins)
print(f"Loaded {len(self.dishes)} dishes")
def __len__(self) -> int:
return len(self.dishes)
def getitem(self, idx: int) -> Tuple[Commands, Blobs]:
record = self.dishes[idx]
q = [
{
"AddImage":{
"_ref": 1,
"properties": {
"contributor": record["contributor"],
"name": record["name"],
"location": record["location"],
"cuisine": record["cuisine"],
"caption": record["caption"],
"recipe_url": record["recipe_url"],
"dish_id": record["dish_id"]
}
}
}
]
for i, ingredient in enumerate(record["ingredients"]):
q.append({
"AddEntity": {
"_ref": 2 + i,
"class": "Ingredient",
"connect": {
"ref": 1
},
"properties": {
"Name": ingredient["Name"],
"other_names": ingredient.get("other_names", ""),
"macronutrient": ingredient.get("macronutrient", ""),
"micronutrient": ingredient.get("micronutrient", ""),
"subgroup": ingredient.get("subgroup", ""),
"category": ingredient.get("category", "")
}
}
})
blob = Sources(n_download_retries=3).load_from_http_url(record["url"], validator=lambda x: True)
return q, [blob[1]]
client = create_connector()
generator = CookBookQueryGenerator(dishes="dishes.json")
for query, blobs in tqdm(generator):
result, response, output_blobs = execute_query(client, query, blobs)
if result != 0:
print(response, query)
break
CSV Parser files are also supported by adb CLI.
The CookBookQueryGenerator
class can be saved in a python file called CookBookQueryGenerator.py, and the ingestion can be invoked with adb CLI as followa
adb ingest from-generator CookBookQueryGenerator.py
Caution!
The filename needs to match the class name for the adb command to work.