Generating .do file templates
SurveyCTO can automatically generate Stata .do files that import, merge, and label your data, using your form definition as a guide. They can help you to get quickly up-and-running with Stata, by serving as a starting-point for your back-end processing code.
To download a Stata .do file template for one of your forms:
- Go to the Design tab of your server console, scroll down to the proper form in the Your forms and datasets section, then select Download and Stata .do template.
- Next, select the format in which you export your data: whether it's in long or wide format, and whether it includes group names in the column headers. (See the help topic on data export formats to learn more.)
- If your form has multiple languages, choose which one to use for labeling your fields in Stata.
- Enter the full path to the directory into which you export your CSV data. This will go into your .do file so that it knows where to find the .csv data to import.
- Enter the full path to the directory in which you store your Stata data files. This will go into your .do file so that it knows where to output Stata .dta files.
- Click to download your .do file template.
For most forms, SurveyCTO will give you a single .do file with a name like import_FORMID.do (where your form's unique ID is used instead of "FORMID"). If you selected the "long" export format, however, and your form has repeat groups, then additional .do files will also be output and automatically called from the main import_FORMID.do file; all of these .do files will be packaged together into a .zip file that you will need to unzip before using.
Try exporting your data and then running the import_FORMID.do file, to see if it works okay. If you see any errors, check out the troubleshooting section below. And see the additional section below that to learn more about the template. It's only meant to get you started on your back-end data-processing code, so feel free to revise, extend, and delete whichever parts you don't need or want.
If you run into Stata errors (in red) or other problems when running one of the auto-generated templates, the issue may be one of these common ones:
- Is the import directory correct? Open your .do file and find the local "csvfile" macro defined in the initialize form-specific parameters section near the top. That defines the full path to the .csv file to import. Make sure that it matches the location into which you exported your .csv data. If not, either fix it or re-export into the correct location, then try again.
- Is the import filename correct? Look again at the local "csvfile" macro defined near the top of your .do file. Is the .csv filename correct? Is your latest export saved with exactly that name? If not, either rename the export file or fix your Stata .do file, then try again. If you're exporting data from your server console and saving it with your web browser, the browser might add a suffix like "(1)" or "(2)" to the file, to keep from overwriting earlier versions; you'll want to either export with SurveyCTO Desktop or be very careful to fix the save-as filename every time you export, to overwrite any old versions of your data.
- Is the output directory correct? Open your .do file and find the local "dtafile" macro defined near the top of your .do file. That defines the full path to the .dta output file to use. Make sure that the directory specified in that path already exists, and that you have write permissions to it.
- Is there a problem parsing dates? SurveyCTO tries to detect the region of the computer generating the .do file and uses that to determine whether dates in the data are in MDY or DMY format, then outputs calls to clock() and date() accordingly. If SurveyCTO got it wrong – or you're processing data exported on a different computer – then you may need to search for "date" and "clock" in the .do file and adjust the MDY or DMY parameter to match the regional settings of the exporting computer. (When all else fails, use a text editor to open the raw .csv file you're trying to process, and see what format the dates are in. Then make sure the date() and clock() calls specify the correct format.)
- Is there a problem with over-long variable names? Stata only allows variable names up to 32 characters long. If you have very long group and/or field names, the first 32 characters could fail to uniquely identify a variable. If you run into this problem, you're using SurveyCTO Desktop to export your data, and group names are included in your .csv column headers, your best bet is to disable that option and no longer include group names (see Data export options), then re-generate your Stata templates and re-export your data; that will shorten many field names because they will no longer include enclosing group names.
- Is there a problem running repeat-group .do files? If your data is in "long" format and you get a Stata error like "file import_formid-repeatname.do not found", the issue is with the "do" commands at the very bottom of the main import_FORMID.do file: Stata is not able to find one or more of the repeat-group-specific .do files that were included in the original template .zip file. Many versions of Stata will automatically find these files in the same directory as the main import_FORMID.do file, but you may need to either "cd" to the proper working directory before the "do" commands or add explicit path prefixes onto the filenames, so that Stata knows precisely where to find each file.
Understanding .do file templates
Even if you do not plan to customize them much, you should familiarize yourself with the Stata code contained in these templates. Most broadly, each template does the following:
- Imports, labels, and formats all incoming data.
- Merges with any previously-imported data, dropping any duplicates. (By default, previously-imported data is respected and not overwritten, but see this help topic about overriding that behavior if you allow un-approving data in your review and correction workflow.)
- Saves the revised Stata dataset.
- Applies a local list of data corrections, if any. Your best bet is to use SurveyCTO's built-in review and correction workflow to safely apply corrections to incoming data, but SurveyCTO's Stata templates still include legacy code to support corrections from a local .csv file. See further below for more details.
- If using long format, organizes data for repeat groups (if any) into separate .dta files (linkable via the key and parent_key variables).
In more detail, each template:
- Initializes Stata ("clear all", "set mem", etc.). Depending on your memory requirements and version of Stata, you may need to revise this code.
- Initializes filenames and locations in local macros. If you later want to change your .csv or Stata directories, you can update these macro definitions.
- Lists any names of text, note, date, and date-time fields in local macros. The fields are listed with the names as they will come into Stata, from the exported .csv headers. SurveyCTO tries to make these lists as accurate as possible, based on the form definitions. However, it is possible that you might need to tweak them.
- Imports the primary incoming .csv file.
- Drops any note fields, since they do not contain data.
- Converts any date and date-time variables from text format into Stata's internal date/time format. That way, they sort and filter properly. Please note that the default code uses the clock() and date() functions to parse incoming dates, and we automatically assume MDY or DMY date ordering based on the regional settings of the computer outputting the template. If your computer's regional settings are different from the computer that exported the .do template, you may need to search for "date" and "clock" in the .do file and adjust the MDY or DMY to match your computer's regional settings. (When all else fails, use a text editor to open the raw .csv file you're trying to process and see what format the dates are in. Then make sure the date() and clock() calls specify the correct format.)
- Converts all text fields to text format. By default, this includes "calculate" fields – but you can destring them later if you want, or remove them from the text_fields macro to not convert them at all.
- Labels variables and select_one values. Note that Stata can only label numeric values, so the template will only label select_one values when all possible values are numeric. Note also that variable labels are truncated at 79 characters, but the template also adds a "note" to each field with the full text of the label (as found in the form definition).
- Merges with any previously-imported data, dropping any duplicates. Because the Stata process is designed to run repeatedly – each time importing .csv files that likely contain both old and new data – the import process defaults to never overwriting existing data with incoming .csv data; that way, you can always update or extend existing data in the Stata file without fear of it being overwritten, and the import process will only add new data to the existing data file. However, if you use a review and correction workflow and allow un-approving data, then the default behavior will mean that you potentially miss changes in data that happen after submissions have been approved. To re-import data that was previously imported – and catch potential corrections made after the initial approval – change the overwrite_old_data local macro at the top of the template from 0 to 1. See this help topic for more on advanced correction workflows.
- Saves the updated data file.
- Outputs the codebook and all variable notes to the Stata console.
Applies corrections from a local .csv file, if any.
Your best bet is to use SurveyCTO's review and correction workflow to manage corrections to your data, but you can also make use of this legacy facility to apply a local list of corrections to your data. To do so, create a .csv file with the path and filename specified in your .do file's "corrfile" local macro, and label the first four columns of the first row with these names exactly: key, fieldname, value, notes. Then, add one additional row per correction you would like applied to the data.
If you want to use Microsoft Excel to maintain the list of corrections, you should maintain a .xls or .xlsx file in which all cells are set as Text format; otherwise, Excel will do funny things like assume that door number "4/16" is April 16 and encode the value as a very long number instead of "4/16". If you maintain your corrections in .xls or .xlsx format, you can either "Save as" .csv format for the Stata template, or you can update the Stata template to import your .xls or .xlsx file directly.
If there are any corrections in the corrections .csv file, they will be applied in sequence, row by row. Each row should indicate the key of the row to correct, the name of the field to correct (as it appears in the header of .csv exports), the corrected value, and any notes you might wish to maintain in the corrections file (optional).
If there is any error applying a correction, an error message will be output to the Stata console, including the row number of the offending correction. For example, if you entered a correction with a value of "John" for a numeric field, or if you had a typo in a field name, there would be an error applying the correction.
Note that the Stata code for applying corrections is tricky. In essence, each correction is output as Stata code, into a temporary .do file, and the template executes the temporary .do file in order to apply the corrections. Thus, you have auto-generated Stata code outputting and running its own auto-generated Stata code.
- If using long format, runs additional .do files to process any secondary .csv files for repeat-group data, essentially applying all of the above logic to each repeat group (saving each as its own .dta file).
These auto-generated Stata templates are mostly meant to get you started on your back-end data-processing. Feel free to revise, extend, and delete whichever parts you don't need or want.