Azure data factory - Copy data from Azure blob storage to Azure cosmos DB

In my previous posts, we saw about copying data from Azure blob storage to Azure cosmos DB using Azure data factory copy wizard. In this post, let us see how we can perform the same copy operation by creating JSON definitions for Linked service, Dataset, Pipeline & Activity from Azure portal.

First we need to create Azure data factory from Azure portal:

Click New -> Data + Analytics -> Data Factory









After creating Azure data factory, click on that -> Author and deploy to create JSON definitions for Linked service, Dataset, Pipeline & Activity from Azure portal.



I have created Azure blob storage and Azure Cosmos DB SQL API in my previous posts, which are source and destination for this Azure data factory copy activity example.


Step 1: Create & deploy Linked services

To get the key for Azure blob storage, we can get easily from Storage explorer (right-click on storage account -> Copy primary key)



Azure data factory -> Author and deploy -> New data store -> Azure storage
Edit Account name & key from below JSON and deploy:



{
    "name": "AzureStorageLinkedService",
    "properties": {
        "description": "",
        "hubName": "azdatafacv1_hub",
        "type": "AzureStorage",
        "typeProperties": {
            "connectionString": "DefaultEndpointsProtocol=https;AccountName=azblobstore;AccountKey=**********"
        }
    }
}






To get the Cosmos DB key:




Azure data factory -> Author and deploy -> New data store -> Azure DocumentDB
Edit accountendpoint, accountkey  & database from below JSON and deploy:



{
    "name": "DocumentDbLinkedService",
    "properties": {
        "hubName": "azdatafacv1_hub",
        "type": "DocumentDb",
        "typeProperties": {
            "connectionString": "accountendpoint=https://azcosmosdbsqlapi.documents.azure.com:443/;accountkey=**********;database=SQLDocDB"
        }
    }
}







Step 2: Create & deploy  Datasets

In the below dataset, we are not going to define the structure of data or mapping, as it is as-is copy of JSON document.

Azure data factory -> Author and deploy -> ...More -> New dataset -> Azure blob storage
Edit the file name, folder path from below JSON and deploy:



{
    "name": "AzureBlobDataset",
    "properties": {
        "published": false,
        "type": "AzureBlob",
        "linkedServiceName": "AzureStorageLinkedService",
        "typeProperties": { 
  "fileName": "Doc3.Json",
        "folderPath": "azblobcontainer",
         "format": {
                    "type": "JsonFormat"
                    }
        },
        "availability": {
            "frequency": "Minute",
            "interval": 15
        },
  "external": true
    }
}


Key properties to be noted in the above JSON: 
external    Boolean flag to specify whether a dataset is explicitly produced by a data factory pipeline or not. If the input dataset for an activity is not produced by the current pipeline, set this flag to true.



availability    Defines the processing window (for example, hourly or daily) or the slicing model for the dataset production. Each unit of data consumed and produced by an activity run is called a data slice. 

Interval of 15 minutes is least we can set for data slicing.


Azure data factory -> Author and deploy -> ...More -> New dataset -> Azure DocumentDB


Edit the Cosmos DB collection name from below JSON and deploy:


{
    "name": "DocumentDbTable",
    "properties": {
        "published": false,
        "type": "DocumentDbCollection",
        "linkedServiceName": "DocumentDbLinkedService",
        "typeProperties": {
            "collectionName": "JsonDocs"
        },
        "availability": {
            "frequency": "Minute",
            "interval": 15
        },
  "external": false
    }
}






Step 3: Create & deploy Pipeline & Activity



Azure data factory -> Author and deploy -> ...More -> New pipeline

Edit the start & End time, copy activity name from below JSON and deploy:

{
    "name": "AzureBlobtoCosmos",
    "properties": {
        "description": "Copy JSON file from Azure blob to Azure Cosmos document DB",
        "activities": [
            {
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "BlobSource"
                    },
                    "sink": {
                        "type": "DocumentDbCollectionSink",
                        "writeBatchSize": 0,
                        "writeBatchTimeout": "00:00:00"
                    }
                },
                "inputs": [
                    {
                        "name": "AzureBlobDataset"
                    }
                ],
                "outputs": [
                    {
                        "name": "DocumentDbTable"
                    }
                ],
                "name": "Activity-Blob-Doc3_Json->JsonDocs"
            }
        ],
        "start": "2017-12-28T11:00:00.00000Z",
        "end": "2017-12-29T11:00:00.00000Z",
        "isPaused": false,
        "hubName": "azdatafacv1_hub"
    }
}



A pipeline is active only between its start time and end time. 
It is not executed before the start time or after the end time. 
If the pipeline is paused, it does not get executed irrespective of its start and end time.

Once all the JSON definitions are deployed successfully, goto Azure data factory -> Monitor & Manage (we can change the start and end time and click Apply and right-click on pipeline -> resume the pipeline)




Azure data factory -> Diagram
If we double-click the input dataset, we can see the data slicing details.





Reference: 

See Also: 

No comments: