Create a data-processing job Get job status Stop a job Listing jobs

Data processing using the `dataprocessing` command

You use the Neptune ML dataprocessing command to create a data processing job, check its status, stop it, or list all active data-processing jobs.

Creating a data-processing job using the Neptune ML `dataprocessing` command

A typical Neptune ML dataprocessing command for creating a new job looks like this:

A command to initiate incremental re-processing looks like this:

AWS CLI


aws neptunedata start-ml-data-processing-job \
  --endpoint-url https://your-neptune-endpoint:port \
  --input-data-s3-location "s3://(S3 bucket name)/(path to your input folder)" \
  --id "(a job ID for this job)" \
  --processed-data-s3-location "s3://(S3 bucket name)/(path to your output folder)" \
  --previous-data-processing-job-id "(the job ID of a previously completed job to update)"

For more information, see start-ml-data-processing-job in the AWS CLI Command Reference.

SDK


import boto3
from botocore.config import Config

client = boto3.client(
    'neptunedata',
    endpoint_url='https://your-neptune-endpoint:port',
    config=Config(read_timeout=None, retries={'total_max_attempts': 1})
)

response = client.start_ml_data_processing_job(
    inputDataS3Location='s3://(S3 bucket name)/(path to your input folder)',
    id='(a job ID for this job)',
    processedDataS3Location='s3://(S3 bucket name)/(path to your output folder)',
    previousDataProcessingJobId='(the job ID of a previously completed job to update)'
)

print(response)

awscurl


awscurl https://your-neptune-endpoint:port/ml/dataprocessing \
  --region us-east-1 \
  --service neptune-db \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for this job)",
        "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)",
        "previousDataProcessingJobId" : "(the job ID of a previously completed job to update)"
      }'

Note

This example assumes that your AWS credentials are configured in your environment. Replace us-east-1 with the Region of your Neptune cluster.

curl


curl \
  -X POST https://your-neptune-endpoint:port/ml/dataprocessing \
  -H 'Content-Type: application/json' \
  -d '{
        "inputDataS3Location" : "s3://(S3 bucket name)/(path to your input folder)",
        "id" : "(a job ID for this job)",
        "processedDataS3Location" : "s3://(S3 bucket name)/(path to your output folder)",
        "previousDataProcessingJobId" : "(the job ID of a previously completed job to update)"
      }'

Parameters for `dataprocessing` job creation

id – (Optional) A unique identifier for the new job.

Type: string. Default: An autogenerated UUID.
previousDataProcessingJobId – (Optional) The job ID of a completed data processing job run on an earlier version of the data.

Type: string. Default: none.

Note: Use this for incremental data processing, to update the model when graph data has changed (but not when data has been deleted).
inputDataS3Location – (Required) The URI of the Amazon S3 location where you want SageMaker AI to download the data needed to run the data processing job.

Type: string.
processedDataS3Location – (Required) The URI of the Amazon S3 location where you want SageMaker AI to save the results of a data processing job.

Type: string.
sagemakerIamRoleArn – (Optional) The ARN of an IAM role for SageMaker AI execution.

Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.
neptuneIamRoleArn – (Optional) The Amazon Resource Name (ARN) of an IAM role that SageMaker AI can assume to perform tasks on your behalf.

Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.
processingInstanceType – (Optional) The type of ML instance used during data processing. Its memory should be large enough to hold the processed dataset.

Type: string. Default: the smallest ml.r5 type whose memory is ten times larger than the size of the exported graph data on disk.

Note: Neptune ML can select the instance type automatically. See Selecting an instance for data processing.
processingInstanceVolumeSizeInGB – (Optional) The disk volume size of the processing instance. Both input data and processed data are stored on disk, so the volume size must be large enough to hold both data sets.

Type: integer. Default: 0.

Note: If not specified or 0, Neptune ML chooses the volume size automatically based on the data size.
processingTimeOutInSeconds – (Optional) Timeout in seconds for the data processing job.

Type: integer. Default: 86,400 (1 day).
modelType – (Optional) One of the two model types that Neptune ML currently supports: heterogeneous graph models (heterogeneous), and knowledge graph (kge).

Type: string. Default: none.

Note: If not specified, Neptune ML chooses the model type automatically based on the data.
configFileName – (Optional) A data specification file that describes how to load the exported graph data for training. The file is automatically generated by the Neptune export toolkit.

Type: string. Default: training-data-configuration.json.
subnets – (Optional) The IDs of the subnets in the Neptune VPC.

Type: list of strings. Default: none.
securityGroupIds – (Optional) The VPC security group IDs.

Type: list of strings. Default: none.
volumeEncryptionKMSKey – (Optional) The AWS Key Management Service (AWS KMS) key that SageMaker AI uses to encrypt data on the storage volume attached to the ML compute instances that run the processing job.

Type: string. Default: none.
enableInterContainerTrafficEncryption – (Optional) Enable or disable inter-container traffic encryption in training or hyper-parameter tuning jobs.

Type: boolean. Default: True.

Note
The enableInterContainerTrafficEncryption parameter is only available in engine release 1.2.0.2.R3.
s3OutputEncryptionKMSKey – (Optional) The AWS Key Management Service (AWS KMS) key that SageMaker AI uses to encrypt the output of the training job.

Type: string. Default: none.

Getting the status of a data-processing job using the Neptune ML `dataprocessing` command

A sample Neptune ML dataprocessing command for the status of a job looks like this:

Parameters for `dataprocessing` job status

id – (Required) The unique identifier of the data-processing job.

Type: string.
neptuneIamRoleArn – (Optional) The ARN of an IAM role that provides Neptune access to SageMaker AI and Amazon S3 resources.

Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.

Stopping a data-processing job using the Neptune ML `dataprocessing` command

A sample Neptune ML dataprocessing command for stopping a job looks like this:

Parameters for `dataprocessing` stop job

id – (Required) The unique identifier of the data-processing job.

Type: string.
neptuneIamRoleArn – (Optional) The ARN of an IAM role that provides Neptune access to SageMaker AI and Amazon S3 resources.

Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.
clean – (Optional) This flag specifies that all Amazon S3 artifacts should be deleted when the job is stopped.

Type: Boolean. Default: FALSE.

Listing active data-processing jobs using the Neptune ML `dataprocessing` command

A sample Neptune ML dataprocessing command for listing active jobs looks like this:

Parameters for `dataprocessing` list jobs

maxItems – (Optional) The maximum number of items to return.

Type: integer. Default: 10. Maximum allowed value: 1024.
neptuneIamRoleArn – (Optional) The ARN of an IAM role that provides Neptune access to SageMaker AI and Amazon S3 resources.

Type: string. Note: This must be listed in your DB cluster parameter group or an error will occur.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Neptune ML API

The modeltraining command

Data processing using the dataprocessing command

Creating a data-processing job using the Neptune ML dataprocessing command

Note

Note

Parameters for dataprocessing job creation

Note

Getting the status of a data-processing job using the Neptune ML dataprocessing command

Note

Parameters for dataprocessing job status

Stopping a data-processing job using the Neptune ML dataprocessing command

Note

Parameters for dataprocessing stop job

Listing active data-processing jobs using the Neptune ML dataprocessing command

Note

Parameters for dataprocessing list jobs

Data processing using the `dataprocessing` command

Creating a data-processing job using the Neptune ML `dataprocessing` command

Parameters for `dataprocessing` job creation

Getting the status of a data-processing job using the Neptune ML `dataprocessing` command

Parameters for `dataprocessing` job status

Stopping a data-processing job using the Neptune ML `dataprocessing` command

Parameters for `dataprocessing` stop job

Listing active data-processing jobs using the Neptune ML `dataprocessing` command

Parameters for `dataprocessing` list jobs