Workflow Data Schema ==================== The Workflow schema captures high-level metadata and structure around a Flowcept-enabled run, including user-defined information, system context, and artifact lineage. Workflow Fields ---------------- - **workflow_id** (str): Unique identifier for this workflow execution. - **used** (dict): Dictionary of data or resources that were used by the workflow. Typically includes model and dataset filepaths, and script global configs. - **generated** (dict): Dictionary of outputs generated by the workflow (e.g., model paths, key summarized results, artifacts). - **parent_workflow_id** (str): Identifier for the parent workflow, if applicable (e.g., for nested workflows or pipelines). - **machine_info** (dict): Information about the machine or compute resource where the workflow was executed. - **conf** (dict): Configuration for the run, typically contains the path to used flowcept settings file. - **flowcept_settings** (dict): Snapshot of Flowcept-specific resolved configuration from the settings.yaml file. - **flowcept_version** (str): Version of Flowcept used during execution. - **utc_timestamp** (float): Timestamp (UTC, seconds since epoch) indicating when this workflow metadata was recorded. - **user** (str): Username of the person or agent who initiated the workflow. Derived from `sys_metadata.login_name` if available; otherwise falls back to `getpass.getuser()`, `os.getlogin()`, or remains `None`. - **campaign_id** (str): Optional campaign identifier grouping related workflows. - **adapter_id** (str): The adapter or source component that launched or instrumented this workflow. - **interceptor_ids** (list of str): List of Flowcept interceptor instance identifiers used during instrumentation. - **name** (str): Human-readable name for the workflow (e.g., "training-run-001"). - **subtype** (str): Optional workflow subtype/category (e.g., ``ml_workflow``). - **custom_metadata** (dict): User-defined metadata for extended tagging or traceability. - **environment_id** (str): Identifier for the execution environment (e.g., cluster name, like Frontier or Summit) - **sys_name** (str): Name of the operating system. Derived from `os.uname()[0]`. - **node_name** (str): Hostname of the compute node. Derived from `os.uname()[1]`. - **hostname** (str): Fully qualified domain name of the host, resolved via `socket.getfqdn()` with multiple fallbacks. - **public_ip** (str): Public IP address of the machine. Derived from `sys_metadata.public_ip`, if available. - **private_ip** (str): Private (intranet) IP address of the machine. Derived from `sys_metadata.private_ip`, if available. - **extra_metadata** (str): Serialized extra metadata captured from external config sources. Notes ----- - The schema prioritizes system-derived metadata from a `sys_metadata` block inside the Flowcept configuration. - System identification is robust, relying on environment variables, standard libraries, and fallback mechanisms for portability. - `used` and `generated` fields support artifact lineage and can store references to any structured or semi-structured resource.