Wednesday, July 28, 2010

Using an XML Format File with Bulk Import in SQL Server 2008

In a previous blog post (http://wardlawclaims.blogspot.com/2010/07/introduction-to-sql-server-2008-data.html ) I provided an introduction to the BULK INSERT statement which can be used to import the contents of a CSV file into a table in a SQL Server database.  In this post I will demonstrate how a format file can be used when the columns in the source file are ordered differently than those in the target table.  Below is the T-SQL used to create the target "Contacts" table:

CREATE TABLE Contacts
(
  FirstName VARCHAR(30) NOT NULL,
  LastName VARCHAR(30) NOT NULL,
  PhoneNumber VARCHAR(25) NOT NULL
);

Next is a screen shot of the source CSV file.


While the Contacts table has columns named "FirstName" and "LastName", the CSV file has corresponding columns named "Given name" and "Surname".  The CSV file also contains an additional column, "Date of Birth" which needs to be ignored since no corresponding column exists in the Contacts table.  The default behavior of the BULK INSERT command expects that the columns in the source and target appear in the same order.  To avoid this requirement I will use a format file to specify how the columns in the source file will map to the columns in the target table.

Below is the format file's contents, note that I am using an XML format file.


<xml version="1.0" >
<BCPFORMAT 
  xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RECORD>
  <FIELD ID="LastName" xsi:type="CharTerm" TERMINATOR="," />
  <FIELD ID="FirstName" xsi:type="CharTerm" TERMINATOR="," />
  <FIELD ID="BirthDate" xsi:type="CharTerm" TERMINATOR="," />
  <FIELD ID="PhoneNumber" xsi:type="CharTerm" TERMINATOR="\r\n" />
  </RECORD>
  <ROW>
  <COLUMN SOURCE="FirstName" NAME="1" xsi:type="SQLVARYCHAR" />
  <COLUMN SOURCE="LastName" NAME="2" xsi:type="SQLVARYCHAR" />
  <COLUMN SOURCE="PhoneNumber" NAME="3" xsi:type="SQLVARYCHAR" />
  </ROW>
</BCPFORMAT>

The "RECORD" element corresponds to the source file.  Each "FIELD" element maps to a column in the CSV file.  The "ROW" element corresponds to the target table.  Each "COLUMN" element corresponds to a column in the target table.  Both FIELD and COLUMN elements are listed in the format file in the same order they appear in the source file and target table.


Each FIELD has two required attributes: "ID" and "type".  The ID attribute uniquely identifies each FIELD element and the type specifies the data type contained in the column.  The ID value is arbitrary; it does not need to match the column name specified in the source file.  In this case I have substituted "LastName" in the format file for "Surname" in the source file and so on.  I have also included the "TERMINATOR" attribute to specify that columns are delimited by commas, except for the final column in each row which is terminated by a Windows new line.


Each COLUMN has two required attributes: "SOURCE" and "NAME".  The SOURCE value matches the ID value of the FIELD that the SOURCE row corresponds to.  In this case the FirstName column appears first in the target table so it is listed first in the ROW element.  The NAME value is an arbitrary value to uniquely identify each COLUMN element.  The "type" attribute is optional and specifies the data type for the column in the target table.  All columns in the target table are of type VARCHAR so the format file type of SQLVARYCHAR is used.  Because the target table does not contain a column to store the date of birth there is no COLUMN that corresponds to the BirthDate FIELD.


Here is the T-SQL statement used to execute a bulk import using an XML format file.

BULK INSERT Contacts 
FROM 'C:\sampledata\source.csv' 
WITH  
(
  FORMATFILE = 'C:\sampledata\contacts_format_file.xml',
  FIRSTROW = 2
);

The FORMATFILE value is the path to the XML format file and the FIRSTROW value of 2 instructs the server to ignore the first record in the source file which contains our column headings.


Below is a screen shot of the contents of the "Contacts" table after the import.


The format file I used in this post was in XML format, you can also use the bcp (Bulk Copy Program) utility included with SQL Server to create a plain text format file.

References

Schema Syntax for XML Format Files from SQL Server Books Online:

The BULK INSERT statement from SQL Server 2008 Books Online: 

The sample data was created using Benjamin Keen's online data generator:

Download the sample CSV file:

Download the XML format file:

No comments:

Post a Comment