Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

SAS merges datasets


May 27, 2021 SAS


Table of contents


Multiple SAS datasets can be combined based on specific public variables to give a single dataset. T his is done using the MERGE statement and the BY statement. T he total number of observations in the merged dataset is usually smaller than the sum of the observations in the original dataset. T his is because when the values of the public variables match, the variables form two data sets that are merged into one record.

Here are two prerequisites for merging datasets:

  • The input dataset must have at least one common variable to merge.
  • The input dataset must be sorted by the public variables that will be used for consolidation.

Grammar

The basic syntax of THEGE and BY statements in SAS is:

MERGE Data-Set 1 Data-Set 2
BY Common Variable

Here is a description of the parameters used:

  • Data-set1, Data-set2 is the name of the dataset that is written one by one.
  • Public variables are variables that will be merged based on a collection of data based on their matching values.

An example of data consolidation

Consider two SAS datasets, one containing an employee ID with a name and salary, and the other containing an employee ID with an employee ID and department. I n this case, in order to get complete information about each employee, we can combine the two datasets. T he final dataset will still have an observation for each employee, but it will contain salary and department variables.

# Data set 1	
ID NAME SALARY	
1 Rick 623.3		 
2 Dan 515.2 		
3 Mike 611.5 		
4 Ryan 729.1 
5 Gary 843.25 
6 Tusar 578.6 
7 Pranab 632.8 
8 Rasmi 722.5 

# Data set 2
ID DEPT
1 IT 
2 OPS
3 IT 
4 HR 
5 FIN 
6 IT 
7 OPS
8 FIN 

# Merged data set
ID NAME SALARY DEPT	
1 Rick 623.3	IT 		 
2 Dan 515.2 	OPS	
3 Mike 611.5 	IT 	
4 Ryan 729.1    HR 
5 Gary 843.25   FIN 
6 Tusar 578.6   IT 
7 Pranab 632.8  OPS
8 Rasmi 722.5   FIN 

The above results are implemented by using the following code, where public variables (IDs) are used in BY statements. N ote that observations in both datasets are sorted in the ID column.

DATA SALARY; 
  INPUT empid name $ salary  ; 
DATALINES; 
1 Rick 623.3		 
2 Dan 515.2 		
3 Mike 611.5 		
4 Ryan 729.1 
5 Gary 843.25 
6 Tusar 578.6 
7 Pranab 632.8 
8 Rasmi 722.5 
;
RUN; 
DATA DEPT; 
  INPUT empid dEPT $ ; 
DATALINES; 
1 IT 
2 OPS
3 IT 
4 HR 
5 FIN 
6 IT 
7 OPS
8 FIN 
;
RUN; 
DATA All_details;
MERGE SALARY DEPT;
BY (empid);
RUN;
PROC PRINT DATA=All_details; 
RUN;  

The value is missing from the matching column

There may be cases where some values of public variables do not match between datasets. I n this case, the dataset is still merged, but the missing value is given in the result.

Cases

Consider the lack of employee ID:3 in the dataset salary, and the lack of form dataset DEPT in the employee ID:6. W hen the above code is applied, the following results are obtained.

ID NAME SALARY DEPT	
1 Rick 623.3	IT 		 
2 Dan 515.2 	OPS	
3 .		.		IT
4 Ryan 729.1    HR 
5 Gary 843.25   FIN 
6 Tusar 578.6   .
7 Pranab 632.8  OPS
8 Rasmi 722.5   FIN 

Only matches are merged

To avoid missing values in the results, we can consider keeping only observations with matching values for public variables. T his is done by using an IN statement. T he TARGET statement of the SAS program needs to be changed.

Cases

In the following example, the IN-value retains only the observations that match the values from the dataset SALARY and DEPT.

DATA All_details;
MERGE SALARY(IN=a) DEPT(IN=b);
BY (empid);
IF a=1 and b=1;
RUN;
PROC PRINT DATA=All_details; 
RUN;  

When we execute the SAS program with the above-mentioned changes, we get the following output.

1 Rick 623.3	IT 		 
2 Dan 515.2 	OPS	
4 Ryan 729.1    HR 
5 Gary 843.25   FIN 
7 Pranab 632.8  OPS
8 Rasmi 722.5   FIN