SAS Dataset Performance Considerations
This summarized write up addresses the roles of SAS option BUFNO and BUFSIZE specifically. The key is knowing your data and data density. Data density plays key role within PROC SQL use.
- Important practical point: BUFSIZE= is mainly a creation-time decision. If a data set already exists, changing
options bufsize=...;will not rewrite its existing page size. Changing page size requires copying/recreating the data set with the new setting. - You may use more buffers while reading data. Find your default BUFSIZE value by running:
proc options option=bufsize value; run; data work.claims_clean; set stage.big_claims(bufno=32); run; - In a shared SAS environment, it is advisable not to set large setting globally. This will require some tweaks in code and trying different combinations.
- Use targeted data set options on the large steps that matter. This gives you control without forcing every SAS job on the server to use the same memory-heavy settings.
options fullstimer source2; data mart.final_claims(bufsize=64k); merge stage.claims_sorted(in=a bufno=32) stage.members_sorted(in=b bufno=32); by member_id; if a; run; - Use PROC CONTENTS to find data pagesize and then perform some limited tweaks accordingly.
- You can also summarize large data set metadata from dictionary tables:
proc sql; title "Large SAS Data Set Characteristics"; select libname, memname, nobs format=comma18., obslen format=comma12., bufsize format=comma12., npage format=comma18., filesize format=comma18. from dictionary.tables where libname = "STAGE" and memtype = "DATA" order by filesize desc; quit; - Then test a small number of combinations:
options fullstimer msglevel=i; /* Baseline: SAS defaults */ data work.test_default; set stage.big_claims; run; /* Moderate buffer count */ data work.test_bufno16; set stage.big_claims(bufno=16); run; /* Higher buffer count */ data work.test_bufno32; set stage.big_claims(bufno=32); run; /* Recreate with larger page size */ data work.test_bufsize64k(bufsize=64k); set stage.big_claims; run; /* Recreate with larger page size and read with more buffers */ data work.test_bufsize64k_bufno32(bufsize=64k); set stage.big_claims(bufno=32); run; - Compare the SAS log for:
- real time
- cpu time
- memory
- OS memory
- page faults
- physical I/O
The winning setting is not the one with the biggest BUFNO or BUFSIZE; it is the one with lower elapsed time and acceptable memory use.
options fullstimer msglevel=i;
proc options option=(bufno bufsize memsize sortsize threads cpucount compress);
run;
proc contents data=stage.big_claims;
run;When These Options Help Most
BUFNO= and BUFSIZE= are most likely to help when:
| Scenario | Likely benefit |
|---|---|
| Repeated full-table reads of large sas7bdat files | Moderate to good |
| Large sequential DATA step transformations | Moderate |
| Large merges of sorted SAS data sets | Moderate |
| Repeated access to the same smaller lookup table | Good, but consider SASFILE or hash lookup |
| PROC SQL aggregation over very large tables | Sometimes, but WORK/SORTSIZE often matter more |
| Importing original XLSX files | Limited |
| Reading CSV/TXT before creating SAS data set | Limited for input parsing, useful for output SAS data set |
| Random/index-based lookup | Needs separate testing; default page size is generally optimized for sequential access |
Efficient PROC SQL for Large SAS Datasets (250-500GB)
With datasets this size on a shared single-server, the goal is minimizing I/O passes and avoiding accidental Cartesian-style overhead. A few proven techniques:
- Avoid PROC SQL for simple joins when DATA step merge suffices. PROC SQL builds internal indexes/hash structures and can be memory-hungry; a sorted DATA step MERGE is often faster and more predictable on multi-hundred-GB tables.
- When using PROC SQL, always filter before joining. Use subqueries to pre-filter each table, for example:
proc sql; create table work.result as select a.*, b.value from (select * from big1 where date >= '01JAN2026'd) a inner join (select * from big2 where status='A') b on a.id = b.id; quit; - Index key columns used in WHERE/JOIN via PROC DATASETS (INDEX CREATE) — this lets SQL use index-based lookups instead of full scans.
- Use inner join over where-style implicit joins — explicit joins let the optimizer choose better access paths.
- Bump SORTSIZE beyond your current setting if you have enough RAM on the processing server and by running code below. In most cases the values are what was set out of box after an install — joins often trigger internal sorts; more sortsize avoids utility-file spills to disk.
proc options option=sort value; run; - Avoid SELECT * — pull only needed columns to reduce memory footprint and page I/O, especially for wide tables.
- Split CSV-to-SAS conversion into its own step with LRECL/compression tuned — don't let conversion contend with concurrent SQL sessions.
- Use COMPRESS=BINARY on output tables to cut I/O substantially for numeric-heavy wide data.
- After source data is read in process entirety of your code using WORK. The last steps to place final results could save to your permanent data output location.
- Know your data densityto provide appropriate WHERE predicates to do away with any unnecessary steps of merge/sorts to do it ahead of time; however, see suggestion #s 3 & 4 above.
- Use _METHOD and _TREE optionsto see PROC SQL's execution plan:This shows whether SQL chose a sort-merge, index, or sequential join — critical for diagnosing slow queries on huge tables.
proc sql _method _tree; select ... quit; - Pre-sort large tables once, then reuse. If the same table is joined repeatedly on the same key, sort it once (PROC SORT with TAGSORT for wide data) rather than letting PROC SQL re-sort internally each run.
- TAGSORT — for wide tables with many non-key columns, sorts only key+pointer, dramatically reducing utility file I/O.
- Use PROC APPEND instead of SQL INSERT INTO ... SELECT when concatenating large tables — it's far lighter than SQL processing.
- Partition/split tables by a logical key (e.g., year, region) into separate physical datasets. Many "500GB" problems shrink to 20GB-per-partition problems with WHERE pruning at the dataset level instead of row level.
Example: Partitioning a 500GB Table by Year
Problem: One giant claims.sas7bdat (500GB) covering 2015–2026. Most queries only need 1-2 years, but PROC SQL still scans the whole file because the WHERE year=2026 filter is a row-level filter applied after SAS opens and reads the dataset.
Step 1 — Split once during/after ETL, not every run:
%macro split_by_year(yr);
data work.claims_&yr.;
set raw.claims;
where year = &yr.;
run;
proc copy in=work out=perm;
select claims_&yr.;
run;
%mend;
%split_by_year(2024)
%split_by_year(2025)
%split_by_year(2026)Note: On the set raw.claims; statement you may use KEEP in the SET statement to only bring relevant variables, and tweak bufno.
Result: perm.claims_2024, perm.claims_2025, perm.claims_2026 — each ~20-40GB instead of one 500GB file.
Step 2 — Query only what's needed (dataset-level pruning):
proc sql;
create table work.result as
select * from perm.claims_2026
where status = 'A';
quit;SAS never touches 2015–2025 data at all — no I/O, no scan, no wasted buffers.
Step 3 — Multi-year query? Union only the partitions you need:
proc sql;
create table work.result as
select * from perm.claims_2025
union all
select * from perm.claims_2026;
quit;Also: Combine with a macro-driven %sysfunc(exist()) check to dynamically build the UNION list based on a date-range parameter, this avoids manual partition name manipulation or setup.
This trades one-time split cost for every subsequent query being dramatically cheaper — usually worth it if the table is queried more than a handful of times.