Mn/Model Final Report Chapter 4: GIS Design

Quick Links

Chapters

Appendices

Chapter 4

GIS Design

By Elizabeth Hobbs

Statewide Survey Impelmentation Model Map

Chapter 4 Table of Contents
4.1 Introduction
4.2 GIS Components
      4.2.1 Hardware and Software
      4.2.2 GIS Data
      4.2.3 GIS Procedures
      4.2.4 GIS Staff
4.3 GIS Data Standards
      4.3.1 Geographic Coordinates and Projections
      4.3.2 Map Units
      4.3.3 Grid Resolution
      4.3.4 Metadata
      4.3.5 Database Standards
4.4 GIS Data Sources
4.5 Data Quality
      4.5.1 Components of Data Quality
      4.5.2 Consequences of Data Quality
      4.5.3 Data Confidence Model
4.6 Regionalization
      4.6.1 Phase 1 Regionalization
      4.6.2 Phase 2 Regionalization
      4.6.3 Phase 3 Regionalization
      4.6.4 Potential for Regionalization Based on Past Environments
4.7 Procedural Design
      4.7.1 GIS Data Development
      4.7.2 Operationalizing Variables
      4.7.3 Modeling Procedures
4.8 Conclusions
      References

4.1 INTRODUCTION

A Geographic Information System (GIS) consists of the computer hardware, software, data, procedures, and staff required to capture, store, update, manipulate, analyze, and display geographically referenced information. An introduction to the subject can be found in Star and Estes (1990) and DeMers (1997). Maguire et al. (1991) provide more technical detail. To work effectively, all components of a project’s GIS must be selected to best meet the task at hand. For all of the components to work together to produce accurate results, project standards must be developed and followed. Data quality must be evaluated and documented. Appropriate data conversion, analysis, and quality control procedures must be established and followed. Finally, standards for presentation of data and results must be considered. This chapter reports on these aspects of the design of the GIS for Mn/Model.

Return to Top

4.2 GIS COMPONENTS

4.2.1 Hardware and Software

Both hardware and software selection are important considerations in a project of this magnitude. GIS software determines ease of use, what functions are available, to what extent procedures can be automated, the efficiency of complex processes, what data formats can be converted, the final format of the data and models, and whether software licenses can be shared. Hardware and operating systems must be capable of reliably running the selected software, allow multi-tasking, support networking and the sharing of system resources, provide sufficient file storage, and have high file transfer, processing, and graphics speeds for efficiency. All hardware and software components must be compatible and well integrated.

The size of the final Mn/Model database was a primary consideration in selecting computer hardware. Assembling a large number of statewide GIS databases required a very large amount of available hard drive capacity. Processing and analyzing large databases efficiently requires fast processors and adequate RAM (Random Access Memory). Multitasking, multiple user access, and system stability were additional considerations. Given the available technology at the beginning of the project, UNIX workstations were determined to provide the best platform. UNIX workstations used over the course of the project included a Sun Sparcstation running Solaris 2.3, Sun Ultra and Ultra2 workstations running Sun OS 5.5.1 operating system, and a Silicon Graphics Indigo running IRIX 5.3 operating system. Necessary peripherals included three tape drives, four CD-ROM (Compact Disc-Read-Only Memory) readers. In addition, 32 to 52 GB of hard drive space were available to the project. A Hewlett Packard DesignJet 755CM plotter was used for map production.

The UNIX workstations were networked, using PC-NFS (Personal Computer -Network File System) software from Intergraph Software Solution, to Intel-based desktop computers for the GIS Staff. All desktop computers were Pentium 90s or better running Windows 3.51, Windows 95, Windows NT 3.5.1 or Windows NT 4.0. The network allowed staff to use UNIX software from their desks and to view and analyze data stored on the UNIX hard drives using software running on their desktop computers.

Most GIS processing and analysis tasks were performed on the UNIX workstations. The software selected for this platform was ARC/INFO v. 7.03 (Environmental Systems Research Institute [ESRI]) and its extension, ARC/INFO GRID. The raster-based modeling capabilities provided by GRID were the primary consideration in software selection. Other considerations were its ability to handle large datasets and provision of a wide range of GIS functionality. Moreover, most of the data available for the project were already in ARC/INFO format and ARC/INFO was already used at MnDOT.

The UNIX version of ArcView v. 2.1 (also from ESRI) was used for some quality control, data display, and database work. Statistical analysis and selection of model variables were accomplished in S-Plus software running in UNIX. S-Plus is a product of Mathsoft, Inc. This statistical software had the advantage of being both very powerful and programmable.

For efficiency, some tasks were performed primarily on the desktop computers. ArcView software (Windows version 3.0) was used extensively for quality control, data display, and database work. All map compositions were created using ArcView 3.0. Some analysis was done in the Spatial Analyst extension of ArcView 3.0 running in Windows NT. AutoCAD (CAD stands for computer-aided design), a product of Autodesk, was used for digitizing. ArcCAD, an ESRI product, was used to convert from AutoCAD drawings to GIS coverages. Occasional use was made of EPPL7 (Environmental Planning and Programming Language) software, which was developed by the Minnesota Land Management Information Center. EPPL7 was used only when source data were in EPPL7 format and had to be converted to GRID format. In Phase 2, correlation coefficients were generated by SPSS (Statistical Product and Service Solutions) for Windows, release 6.13. This task was performed by S-Plus software in Phase 3.

In May, 1997, MnDOT adopted all of the ESRI GIS software products mentioned above as part of its GIS Technology Standards (The P118 Project Team 1997).

4.2.2 GIS Data

Developing a statewide archaeological predictive model requires designing and assembling a large high-resolution GIS database, extracting the most useful information from the data at hand, analyzing relationships between these data, then mapping the results of the analyses. The final results can be no better than the original data. Consequently, identifying and acquiring the best available data to meet the project’s needs is a critical task.

GIS data come in two general data structures, vector and raster. Vector data represent geographic features as points (i.e. archaeological site centroids), lines (i.e. streams), and polygons (i.e. lakes). In ARC/INFO, and throughout this report, vector layers are referred to as coverages. This data structure is best suited for display of features that are discrete, with well-defined edges. Each individual feature may be considered a separate object (Maguire et al. 1991: 200). However, vector structure has several shortcomings that make it unsuitable for predictive modeling. Vector data cannot be used to accurately represent features that change continuously across space, such as elevation or probabilities of finding archaeological sites. They do not lend themselves easily to statistical analysis, as features are usually not of a uniform size. Finally, vector layers require a lot of storage space (relative to the same information stored in raster structure) and place complex processing demands on the system.

Raster data structure represents geographic data as values in a matrix of cells. Each cell in the matrix is the same size and shape and can contain only one numeric value per layer. In ARC/INFO, raster layers are referred to as grids. They excel at representing data that change continuously across a surface (i.e. elevation, distance to water). Because cells are of uniform size, contain numeric values, and each cell is treated as a separate object for analysis, mathematical equations are readily applied to one or more grids at a time for analysis and modeling. Because of their much simpler data structure, raster layers take up less storage space and process more quickly than vector data. Consequently, raster data are well suited for modeling surfaces. They do, however, have their drawbacks (Maguire 1991: 199). Converting data from vector to raster results in a loss of locational precision and generalization of the shapes of discrete objects. It distorts the measurements of areas and perimeters, and it loses information about connectivity. Moreover, raster data is not necessarily the best solution for all kinds of modeling. However, for most purposes of Mn/Model, raster data structure was determined to be the most suitable.

Mn/Model required the development and analysis of a number of high-resolution, statewide natural and cultural resource databases acquired from several sources (Section 4.4). The primary data came in a number of different formats, including paper, vector, raster, and as digital databases. For all of these data sets to work together, they had to be converted to a common standard (Section 4.3). Procedures were developed to convert these data to a suitable consistent format and integrate the data layers so they could be used together (Section 4.2.3).

4.2.3 GIS Procedures

Standardized procedures ensure that data standards are met, quality is maintained, and products are consistent across the entire geographic extent of the project. They maintain quality control and eliminate errors while entering and extracting data from the GIS, integrating separate datasets, deriving variables, performing statistical analyses, building models, evaluating model performance, and creating map compositions and other graphics. In Mn/Model, standards and procedures provided consistency across the entire state. Because there are no data processing guidelines available from MnDOT, this project developed its own procedures.

A Standards and Procedures Manual was maintained from the beginning of the project. As the project grew more complex, the manual evolved into separate documents that were continually updated. These documents were reviewed when the first models for the Phase 1 regions were completed and before the Phase 2 regions were assembled. This was done in conjunction with detailed quality control on the data layers already produced. Any problems found in the data layers were tracked back through the documented procedures, corrected if necessary, and the revised documentation was organized into the format for the final report and technical manual. Data layers and procedures received additional review and revision at the beginning of Phase 3. The Phase 3 procedures are documented in Chapters 4, 6, 7, 8, and Appendix B. Phase 1 and 2 procedures are documented in this volume only when they remained unchanged throughout the project or if a description is necessary for comparison to later procedures that replaced them.

4.2.4 GIS Staff

The size of the GIS staff working on Mn/Model varied from two, at the beginning of the pilot phase, to 13 at the height of Phase 3. Most of the time about half of these were full-time and half part-time on this project. A wide range of experience and skills was used. It was important that higher-level staff had previous experience in a related research field as well as broad expertise in GIS. Ideally, this experience should include statistical analysis of data. Strong educational backgrounds in physical geography and natural resources were particularly helpful and complemented the knowledge of the archaeologists and geomorphologists on the team. Technical skills that contributed most to the project were experience in many aspects of the selected software, previous experience with both vector and raster data, expertise in database design and manipulation, GIS programming for automating procedures, and cartographic design for presenting the results. Organizational abilities and attention to detail are other important qualities for working successfully on such a complex project. Finally, the ability to communicate clearly is critical. Staff must be able to document procedures or revisions to procedures and to train other staff in these procedures. Several good writers and editors are necessary to complete the final project documentation.

Project tasks were assigned on the basis of staff experience and availability. The Principal Investigator for GIS, who had a Ph.D. in geography, relevant research experience, and several years experience as a GIS consultant, provided

staff coordination and direction,
acted as a liaison to the project team, project advisors, and GIS Technical Committee,
developed the research design and detailed procedures,
evaluated results and revised the research plan accordingly,
documented the project,
and provided public outreach in the form of presentations and newsletter articles.

A GIS consultant with a Ph.D. in geography and considerable research experience

acted as a liaison to the project geomorphologists,
contributed suggestions to the development of their classification scheme,
provided them with requested data and analyses,
assisted with the integration of their data into the GIS,
and performed the analysis of the paleoclimate data (Chapter 6).

Several GIS consultants with advanced degrees or comparable previous experience

reviewed, revised, developed, and automated procedures,
were responsible for more complex data conversion and integration tasks,
derived variables,
performed statistical analyses,
built and evaluated models,
performed quality control,
participated in public outreach,
assisted in project documentation,
and designed map templates and legends.

A senior technician with many years experience

supervised and assisted interns and entry level consultants,
performed quality control,
refined and developed conversion and archiving procedures,
integrated data,
and designed map composition templates and legends.

GIS consultants with college degrees in geography and up to two years experience

digitized and built coverages,
participated in other data conversion tasks,
performed quality control,
performed data integration tasks,
archived and transmitted data and models,
contributed to project documentation,
created map compositions using pre-designed templates and legends.

Student interns performed the majority of the digitizing, assisted with other data conversion, and created map compositions using pre-designed templates and legends.

Return to Top

4.3 GIS DATA STANDARDS

GIS data standards are important for developing and maintaining quality products that can be cooperatively shared with other agencies and users. Standards insure that databases are consistent from one part of the state to the other, and consequently the results of analyses are consistent and interpretable as well. Standards also promote efficiency in data conversion, processing, and documentation and provide for a quality product.

Efforts are underway at both the state (Cialek 1993; Minnesota Governor’s Council on Geographic Information GIS Standards Committee 1998) and federal (Federal Geographic Data Committee 1997) levels to develop standards. These are intended to both promote GIS data quality and to facilitate the integration of data from a wide range of sources into statewide, regional, or national GIS databases. However, these initiatives are relatively young and few standards have been adopted yet. Without guidance from federal or state guidelines, Mn/Model project standards were developed on the basis of good GIS practice, project needs, and adherence guidelines provided by MnDOT.

4.3.1 Geographic Coordinates and Projections

Geographic coordinates register the GIS data to a real place on the face of the earth (Bonham-Carter 1994: 87-88; Maling 1991: 135-139; Star and Estes 1990: 98-99). The most widely known of these systems is latitude/longitude. Map projections control the amount and kind of distortion that occurs when the earth’s curved surface is represented in only two dimensions (Bonham-Carter 1994: 88-95; Maling 1991: 139-146; Snyder and Voxland 1989; Star and Estes 1990: 99-101). Data from different sources must be in the same coordinate system and projection to be integrated either vertically (overlaid on one another) or horizontally (appended to form a larger geographic area). Consequently, it is extremely important to know the coordinate systems and projections of the source data, whether digital or hard copy.

MnDOT has not formally adopted coordinate standards. They have developed some guidelines. These include:

Universal Transverse Mercator (UTM) zone 15 extended as the coordinate system/map projection for statewide data or data transfer. Most of Minnesota is in UTM zone 15. So that data can be maintained and displayed for the entire state, zone 15 is extended to include the portions of the state in zones 14 and 16. These areas are along the western edge of the state and in the northeast corner. Some distortion of these areas results from their inclusion in zone 15.
NAD83 datum. A geodetic datum is a model of the earth’s shape. Models may take a variety of forms, but the most realistic is the spheroid, a form of ellipsoid (Bonham-Carter 1994: 89, 360). Different datums may show the same point on the surface of the earth at slightly different coordinates, even if both are in the same coordinate system. NAD83 refers to the North American Datum of 1983. This is a horizontal control datum for the United States, Canada, Mexico, and Central America. It is has an earth-centered origin (as opposed to having its origin at a point on the earth’s surface) and is based on the Geodetic Reference System 1980 (GRS80) spheroid. It was defined on the basis of both terrestrial and satellite data (Bonham-Carter 1994: 368; ESRI 1992: 7; Montegomery and Schuch 1993: 268).
No coordinate shift. Coordinate shifts are used to allow more digits to the right of the decimal point to be stored in the database. This is done by subtracting a very large constant from the coordinate values. Depending on the coordinate system, the horizontal units used, the quality of the data, and the platform, this may improve the spatial accuracy of the data.

Mn/Model followed MnDOT’s guidelines, using, the Universal Transverse Mercator (UTM) projection and coordinate system, zone 15 extended, with the NAD83 datum and no coordinate shift.

4.3.2 Map Units

Horizontal map units determine both the unit of measurement for the GIS dataset and the units in which the coordinate system are expressed. The UTM projection can be expressed in feet or meters, but two UTM datasets in different map units would not overlay. Mn/Model used meters as horizontal units, as UTM meters are suggested by the MnDOT guidelines and are widely used by other Minnesota State agencies that share data.

Vertical map units express surface variations such as elevation, tree height, or building height. Elevation units for Mn/Model were selected for practical reasons and to meet project needs. Some USGS Digital Elevation Models (DEMs) had meters as vertical units and others had feet. In both cases, though data were stored as floating point numbers, the data were integer values, with only zeros to the right of the decimal. Feet were selected as the project standard because they best met project needs. Converting meters to feet and maintaining feet data in their original units allowed the conversion of the floating point DEMs to integer grids without any loss of information. This is an advantage in such a large-scale project, as integer grids take much less storage space than floating point grids.

4.3.3 Grid Resolution

The resolution of a raster dataset or grid is the distance on the ground corresponding to one cell. High resolution grids have small cells; low resolution grids have large cells. The precise distinction between high and low resolution depends on context. For remotely sensed images or elevation data, one meter cells may be considered high resolution, 30 meter cells medium, and 100 meter cells fairly low. In the archaeological predictive modeling literature, the definition of high resolution varies and the upper limit is relatively high. Kohler and Parker (1986) define high resolution as having a quadrat size of less than two hectares (1,414 meters cells). They apparently equate the size of field samples with the size of units represented in the data grid. Kvamme (1992) produced a 50-meter grid for the Pinon Canyon model, which was considered high-resolution. The 30-meter cell size used for Mn/Model is very high resolution within this context.

The standard grid resolution of 30 meters was established to meet project needs and to be appropriate for the most important data layers. MnDOT guidelines suggest using 1:24,000 scale or smaller for data used in planning applications. Consequently, the USGS 7.5 minute (1:24,000 scale) Digital Elevation Models (DEMs) were one of the essential data sources for the project. These raster data have an approximately 30.002-meter resolution in Minnesota. These were resampled to a true 30-meter resolution for this project. Many variables are derived from this source. To resample to a smaller cell size would not improve the data or analysis, and would increase both storage and processing requirements. To resample to a larger cell size would generalize, thus losing information. Moreover, this resolution is very close to the 24 m minimum mapping unit (Section 4.5.1.1 and Table 4.2) for 1:24,000 scale maps. Although few digital datasets are now available at that scale for Minnesota, more should become available in the future. In the meantime, the standard grid resolution used in Mn/Model is quite different from the actual resolution of some of the data represented. Please refer the discussion of data quality in Section 4.5 to understand the ramifications of this.

4.3.4 Metadata

Metadata refers to information about data. Metadata is an important component of any geographic data product. It conveys to the user information about the source of the data, the meaning of the data, the age of the data, the accuracy of the data, what changes have been made to the data, and so on (Bonham-Carter 1994: 87; Star and Estes 1990: 128). Without metadata, GIS databases can be incomprehensible to anyone except the data’s developer. Developing good metadata is time-consuming and has not been required by many agencies in the past. When metadata was developed voluntarily in the past, it took many different forms, as there was no accepted standard. However, with more GIS data being shared among federal, state, and local agencies, the development of standardized metadata has become a priority.

In 1993, the Federal Geographic Data Committee established The Content Standards for Geospatial Metadata. In Minnesota, the GIS Standards Committee of the Governor’s Council on Geographic Information adapted these standards to develop the Minnesota Geographic Metadata Guidelines. Version 1.0 of these guidelines was released in September 1996 (Minnesota Governor’s Council on Geographic Information GIS Standards Committee 1996). These guidelines were used for the Mn/Model project. At that time, the State had developed metadata for only one dataset that conformed to these guidelines. The Mn/Model GIS team developed the remainder of the metadata needed using what paper documentation was available and obtaining additional information through phone calls. These metadata were provided to MnDOT in ASCII format at the end of Phase 2. They were also provided to the state agencies that supplied the data.

During Phase 3, the State adopted a metadata entry software package (DataLogr). All metadata from Phase 2 were re-entered using DataLogr. At the same time they were updated with information from the state agencies that had been developing and improving their metadata. Metadata for new data obtained or developed in Phase 3 were also developed using DataLogr. All project metadata were converted to HTML format for final delivery.

4.3.5 Database Standards

Database design is a science in its own right. However, much of it is concerned with the physical design (locations of different parts of the database on a computer system) and logical design (interrelationship between different data sets) of a complex database (Healey 1991: 253). For Mn/Model, physical and logical design of the entire database developed over the course of the project and were not formalized. Informally, logical design was a consideration only for those datasets that were related to one another in the modeling process. The most important consideration in this project was the design of the individual database tables, which were mostly GIS feature attribute tables.

ARC/INFO software includes a relational database management system. ARC/INFO databases (both feature attribute tables and other related tables) are represented in a very simple data structure of rows, also called records, representing geographic features and columns (also called fields or items) representing attributes of the features. Relational databases have three principal features: the primary key, relational joins, and normal forms (Healey 1991: 257). These are important considerations for the design of each individual database.

The concept of the primary key stems from relational algebra. Within this context, a database table represents a set. Since sets cannot have duplicate values, no table can have any rows with duplicate contents in every column. Given this, there must necessarily be one column (field, item) with unique values for each row or multiple columns that together provide a unique combination of values for each row. This column or group of columns can be defined as the primary key. This primary key provides an "address" for each row (record) in the table (Healey 1991: 257-258).

In vector coverages, the primary key is usually a feature identification number that is coded in the process of converting the data. It identifies each geographic feature uniquely, though many features (such as soil polygons) may share the same values for all other attributes. For vector coverages used in Mn/Model, the primary key was maintained in the ARC/INFO attribute table field USER_ID.

For raster data, each cell contains only one value, which represents an attribute such as elevation or soil type. The value itself becomes the entity feature represented as a single row in the value attribute table (VAT) and is the primary key. The VAT also reports the number of cells containing each value, and additional attributes can be added. However, new attributes must be associated with the primary key. There is no way to assign different secondary attributes to cells having the same primary key value. However, it is possible to use the primary key, or feature identification number, from a vector coverage as the cell value for a grid created from that coverage. In this way, the same primary key is retained and additional attributes from the coverage can be joined to the grid’s VAT. However, this is only necessary when the distinctions between separate features, such as archaeological sites, must be maintained in the grid.

A relational join is a way of linking together different database tables. This is accomplished by matching the values of one column in a database to corresponding values in a column of a separate database (Healey, 1991). In ARC/INFO, for such a join to occur, the two columns must be defined in exactly the same way in every respect. The first column’s name, data type (character, integer, number), the number of decimals stored for the values, and width (how many characters or digits can be stored for each value) must exactly match those of the second column. ArcView does not share this requirement. Because most of the data processing for the project was performed in ARC/INFO, we defined and maintained common field definitions for columns that were found in more than one table.

For a successful join to occur in any relational database software, the values recorded in the two columns must match exactly. For instance, if residential land use is coded as "R-1" in one table and "R1" in another table, no match will be found. Moreover, ARC/INFO is case-sensitive, so "R-1" and "r-1" would not be matched either. At a more elementary level, it follows that values for categorical data must be standardized, both within tables and between tables. For example, within the archaeological database, middens should not appear variously as "midden," "Midden," "M," and "MD." One standard value must be decided upon to simplify query and analysis. It is best that this value be as simple as possible to reduce the possibility of typographic errors. Also, if "M" is selected as the value for middens in the main archaeological database, then references to middens in any other database should use the value "M" as well.

Preferably, the values selected for standardization should be short and simple to facilitate data entry, reduce typographical errors, and reduce data storage requirements. They must be mutually exclusive; the same column cannot use "M" for middens and "M" for multi-component sites. It is good practice to standardize characters on all upper case or all lower case. Using a combination of upper and lower case slows data entry and invites topographic errors. Moreover, some database software, including ARC/INFO, is case-sensitive. Using such software, the values "M" and "m" would not be matched in a join or selected together in a simple query. Following these rules not only allows tables to be joined, it also reduces confusion among those working with the data, reduces data entry errors, and facilitates quality control. Standard values adopted for Mn/Model GIS databases are documented in Appendix B and in the metadata.

Another consideration in designing relational databases is the theory of normal forms. This theory specifies several rules for the design of individual database tables. The intent of these rules is to reduce redundancy in the recording of data and to facilitate data input, updating, querying, and analysis. The first requirement, or first normal form, is that no column-and-row intersection (cell) may contain repeating groups of data, such as lists (Healey 1991: 258). In a record for a multi-component archaeological site, for example, there should not be one column called SITE_TYPE containing a list "midden, burial, artifact scatter." These must be broken out into different columns, the simplest form being a separate column for each site type with Y or N as the appropriate column values. The primary advantage of the first normal form is that it facilitates query and analysis. Constructing a query to reliably extract a single site type out of a list can be confusing, especially for inexperienced database users. It can also lead to errors. These difficulties are best illustrated by another example of a violation of first normal form, the use of multi-character codes to represent multiple attributes of a feature. For example, the National Wetlands Inventory (NWI) code E2EM5Nd can be broken down as follows (Santos and Gauster 1993):

E = Estuarine system
2 = Intertidal subsystem
EM = Emergent class
5 = Narrow-leaved persistent subclass
N = Regular water regime
D = Partially drained/ditched (special modifier)

This code presents multiple problems.

It invites typographical errors.
It requires constant reference to metadata to interpret the codes or to determine which codes are of interest to the study at hand.
Querying is extremely complex. For instance, to find all intertidal subsystems it is not sufficient to simply search for any string containing the number "2." This number is used for different subsystems, depending on what system they are in. Moreover, it is used again for a number of subclasses and for water chemistry modifiers. To find all intertidal subsystems one would have to find strings that contain "E2" or "M2" in the first two positions in the string, but not"EM2" in the second and third position or "E2" in the fourth and fifth position.

The appropriate solution to this problem is to design the table so that it contains the separate fields SYSTEM, SUBSYSTEM, CLASS, SUBCLASS, WATER_REGIME, WATER_CHEMISTRY, and OTHER_MODIFIERS.

The first normal form was followed for all databases created for this project. However, some databases received from other sources violated this rule. A considerable amount of time was spent extracting pertinent information from codes in the NWI database to populate new columns (Appendix B). This version of the database, as part of a coverage called MWI, resides on the MnDOT GIS server. However, it will soon be replaced by the updated version of NWI created by Mn/DNR. Mn/DNR also broke the NWI codes into separate fields and, in addition, corrected some codes. For use in Mn/Model, several attributes in the SHPO database required normalization as well.

The second normal form requires that every column must be fully dependent on the primary key (Healey 1991: 258). This rule applies to tables where there is more than one component, or column, to the primary key. For example, to uniquely identify each section in the Public Land Survey System (PLSS), the primary key would have to consist of the three columns TWN, RNG, and SEC. Any additional column in the table that was dependent on only one or two of these primary key columns (such as the year a township boundary was surveyed) would violate second normal form. In this case, there should be a second table. That table would require the columns TWN and RNG as the primary key (to uniquely identify each township in the PLSS) and a third column for the year of the boundary survey. Because the tables created for Mn/Model did not require multi-component primary keys, this was not a consideration.

The third normal form requires non-transitive dependence on the primary key (Healey 1991: 258-259). Every column that is not part of the primary key must owe its inclusion in the table to its relationship to the primary key. Columns that are included because they are related to a column that is not the primary key should be in a separate table. Consider the example of a feature attribute table associated with a trees coverage. There is a row for each individual tree. The table contains the columns ID, SPP, DIAM, and SUSC.

ID is the primary key and contains a unique integer value for each tree on the map, and consequently for each row in the database.
SPP represents the species of the tree. Each tree can be of one and only one species. However, SPP is not unique for each feature (row) in the database. There may be a large number of trees of the same species.
DIAM represents each tree’s diameter at the time the data were recorded. Each tree can have one and only one diameter. However, like species, diameter is not unique for each tree.
SUSC represents disease susceptibility. Possible values are "oak wilt" and "pine rust." These values are a function of the species. Every oak in the database will be susceptible to oak wilt and every pine to pine rust.

Because SUSC is related to SPP rather than to ID (the primary key), there should be two separate tables. The feature attribute table associated with the trees coverage should have the fields ID, SPP, and DIAM. A separate table should contain SPP and SUSC. If the distribution of susceptibility of trees to oak wilt must be mapped or analyzed, the second table can be joined to the feature attribute table using the shared field SPP. The advantage of the third normal form is that it simplifies tables and reduces the storage space required. If there are 10,000 trees in the trees coverage, the original form of the feature attribute table would require the storage of 10,000 values for susceptibility. With the third normal form, only two values for susceptibility are stored.

Although conformance with the third normal form is good database design and management practice, it is often violated in GIS data that are being actively used in analysis. This is because working with related tables in complex analytical procedures can slow down processing. Because fast and efficient processing was required throughout the project, the third normal form was not always followed. For example, the value attribute table for the high resolution soils data contains one field, which is the primary key. That field is shared with several other data tables that contain data relating to the primary key. Columns in those tables are shared with columns in additional tables that contain data relating to those attributes, but not relating directly to the primary key. In all, the soils databases contain a very large number of attributes, most of which were not relevant to this project. These tables conformed to the third normal form, but performing the analyses using multiple temporary joins would have slowed processing considerable. For efficiency, we permanently joined the attributes of interest, from various tables, to the value attribute table of the soils grid.

Return to Top

4.4 GIS DATA SOURCES

The data used for model development came from many sources (see Table 4.1, Appendix B, and the Mn/Model metadata). Both archaeological and environmental data were required to build the models. Geographic data at a scale of 1:24,000 or better were preferred. However, for some GIS layers, digital statewide coverage was available only at smaller scales (1:100,000 – 1:500,000). Some layers were available only as paper maps and were also at scales smaller than 1:24,000.

Several state agencies were important sources of digital information. Most of the archaeological data were provided by the State Historic Preservation Office (SHPO). With funding from MnDOT, they had recently converted many of their paper records to a digital database format. Site locations were recorded in this database as site centroids, using UTM coordinates measured from 1:24,000 scale topographic maps. The Chippewa and Superior National Forests provided additional archaeological data in digital database format. They also had site centroid coordinates recorded. For this project, UTM coordinates were also recorded for random points within areas surveyed, but where no sites were found. These surveyed areas were identified on paper maps in the SHPO files (see Section 5.7.3). For this project, the points were referred to as negative survey points.

Environmental data came from a variety of sources. Land Management Information Center (LMIC) serves as a clearinghouse for federal and state GIS data. It provided a variety of the layers listed below. The Minnesota Department of Natural Resources (MN DNR) provided a number of natural resources layers they had digitized. In addition, the MN DNR provided USGS DEMs that were already converted to ARC/INFO lattices. MnDOT provided Version 1.0 of its base map, which was digitized from the USGS 7.5 minute quadrangles. Soils data were acquired from several counties and from the Metropolitan Council, in addition to those acquired from LMIC. The Minnesota Geological Survey provided digital maps of bedrock geology for several counties.

Several layers were digitized from paper maps specifically for this project. These included landscape/sediment assemblages in river valleys and selected upland quadrangles, which were mapped as part of the Mn/Model geomorphological research. Trygg maps were digitized for 20 counties (Beltrami, Carlton, Carver, Cass, Chisago, Clay, Douglas, Faribault, Fillmore, Goodhue, Itasca, Kittson, Mower, Nicollet, Nobles, Pennington, Rock, Stearns, Wabasha, Wright). William Trygg mapped features from the General Land Survey plat maps at a scale of 1:250,000. These represent vegetation, roads, and settlements recorded during the establishment of the Public Land Survey System (PLSS) in the mid- to late 19th and early 20th centuries. The paper maps are available from the Trygg Land Company in Ely, Minnesota. Finally, bedrock geology and exposures were digitized for several counties in southeastern Minnesota where digital bedrock maps were not available. Bedrock geology was considered significant only in southeastern Minnesota’s unglaciated region, where outcrops of chert were important resources for making stone tools. More detailed documentation of these data sources can be found in Appendix B.

Table 4.1. GIS Data Sources.

The data provider, when not the primary source, is indicated in parentheses.

Theme	Source	Source Scale	Coverage
Archaeological Sites	State Historic Preservation Office, National Forest Service	1:24,000	Statewide
Lakes	National Wetlands Inventory (LMIC)	1:24,000	Statewide
Major Rivers	National Wetlands Inventory (LMIC)	1:24,000	Statewide
Perennial Streams	MnDOT BaseMap	1:24,000	Statewide
Intermittent Streams	MnDOT BaseMap	1:24,000	Statewide
Wetlands	National Wetlands Inventory (LMIC)	1:24,000	Statewide
Elevation	USGS 7.5 minute DEMs (MN DNR)	1:24,000	81% of state
Elevation	MGC100 (LMIC)	1:250,000	Statewide
Historical Vegetation (Marschner)	MN DNR	1:500,000	Statewide
Major and Minor Watersheds	MN DNR	1:100,000	Statewide
Geomorphic Regions	MGC100 (LMIC)	1:125,000	Statewide
Bedrock Geology	MGC100 (LMIC)	1:500,000	Statewide
Bedrock Geology	MN Geological Survey	1:100,000 or 1:250,000	Six counties (SE Riverine Region only)
Quaternary Geology	MGC100 (LMIC)	1:500,000	Statewide
Bedrock Outcrops	MGC100 (LMIC)	1:1,000,000	Statewide
Landforms	MGC100 (LMIC)	1:125,000	Statewide
Water Erosion	MGC100 (LMIC)	1:250,000	Statewide
Wind Erosion	MGC100 (LMIC)	1:250,000	Statewide
Sedimentation	MGC100 (LMIC)	1:250,000	Statewide
Tree Species Distribution	MN DNR (LMIC)	Very generalized, worse than 1:1,000,000	Statewide
County Soil Surveys	LMIC, Metropolitan Council (for county list see Table B.7)	Better than 1:24,000	24 counties
Soils	MGC100 (LMIC)	1:250,000	Statewide
Landforms	DNR	1:100,000	Statewide
Landform/Sediment Assemblages	Mn/Model	1:24,000	Eight river valleys, one glacial lake basin, and 16 upland quadrangles
Historic Cultural Features	Trygg Maps	1:250,000	Twenty counties digitized

Return to Top

4.5 DATA QUALITY

Quality refers to properties of the data, such as accuracy and resolution. The quality of the GIS data used in this project varied significantly. The known quality of each data layer has been documented in the metadata for that layer and is discussed in Chapter 6 and Appendix B. Users should be aware of any quality issues that might affect confidence in the resulting models. Model quality depends on the quality of each data layer that contributed to the model. To evaluate the overall quality of the data that contributed to each model, a separate model of data confidence was developed (Figure 4.1). This model is discussed in Section 4.5.3.

4.5.1 Components of Data Quality

Data quality is usually measured by four components: locational or spatial accuracy, attribute accuracy, consistency, and completeness (Muller 1991: 459). Source data scale or poor data collection and conversion methods are often the cause of poor data quality. Accuracy and consistency of available data usually cannot be improved without starting over and generating new data. For a project of statewide scope, this is not feasible. By the same token, project resources may not be available to complete datasets that have not yet been finished by the source agency. Little can be done to compensate for these problems except to choose not to use the data. For this project, the best data available were selected. However, even the best data were sometimes of poor quality, particularly for high resolution modeling. The components of data accuracy, their evidence in data used or considered for Mn/Model, and the consequences of using these data are discussed below.

4.5.1.1 Locational Accuracy

Locational accuracy refers to the accuracy with which geographic features are positioned in a spatial dataset (coverage or grid) and is related to the concept of resolution. With reference to data (as opposed to grids), resolution is a function of the size of the smallest object that can be mapped from the source material (Fisher 1991: 175). Often locational accuracy is expressed as the potential for variance of the feature on a map (or the coordinates in a GIS) from the true location of that feature on the ground. This variance has the same value as the resolution.

Surveying produces the best GIS data. The locations of features are measured directly on the ground, in real units, with reference to monuments that are known to be located with a high degree of accuracy. With modern survey technology, objects can usually be located within inches or centimeters of their true location on the earth. This translates to a resolution of inches or centimeters. When the coordinates of these objects are entered directly into the GIS, they are not degraded, so the accuracy of the GIS data is the same as the accuracy of the survey.

Aerial photographs and satellite imagery are common sources of mapped data. The resolution of aerial photographs is a function of many factors, including the characteristics of the object and its background, lens quality, spectral sensitivity of the film, the height of the aircraft that took the pictures, and the clarity of the atmosphere (Estes, Simonett, et al. 1975: 878-882). When aerial photographs are mapped, the map resolution may also be affected by the scale at which the image is printed (see below), or digitally displayed, and the scale of the base map to which the information is transferred. The resolution of remotely sensed data is typically expressed as the minimum size of a feature that can be readily distinguished by the remote sensing system (Star and Estes 1990: 274). This is usually a function of the instantaneous field of view of the sensor (Reeves 1975: 2087).

If data are digitized from paper maps, data resolution is generally considered to be a function of map scale, and resolution is usually defined as the width, in ground units, of a 0.5 mm (0.02 inch) pencil line on the map (Table 4.2). However, pencil line on a map represents a linear feature, not an area. To determine the appropriate grid cell size for a map, the minimum mapping unit must be considered (Fisher 1991: 176). This is the smallest area that can be drawn and is the width on the ground of two lines totaling 1 mm (0.04 inch) (Table 4.2). Creating a grid with a cell size smaller than the minimum mapping unit will not change the resolution of the data.

Because of the ratio between line width and ground distance, the locations of features can be recorded more accurately on maps at scales of 1:24,000 or better than on lower resolution maps (1:62,500 or smaller scale). To envision this, consider that the width of a pencil line on a 1:24,000 scale map is 12 meters in ground units (Table 4.2). Even if mapped with the greatest possible precision, the centerline of a road represented at this scale is considered to be +/- 12 meters from its true location on the earth’s surface. A road mapped at 1:250,000 scale should be interpreted as +/- 125 meters from its true location. This is a considerable loss of spatial accuracy.

Table 4.2. Relationship Between Map Scale and the Minimum Mapping Unit.

(Fisher 1991: 177)

Scale	Resolution	Minimum Mapping Unit
1:10,000	5 m (17 ft)	10 m (34 ft)
1:24,000	12 m (39.5 ft)	24 m (79 ft)
1:62,500	31 m (102 ft)	62 m (204 ft)
1:100,000	50 m (164 ft)	100 m (328 ft)
1:250,000	125 m (410 ft)	250 m (820 ft)
1:500,000	250 m (820 ft)	500 m (1640 ft)
1:1,000,000	500 m (1640 ft)	1000 m (3280 ft)

Attempts to represent continuous geographic data using a vector data structure, which is analogous to traditional pen and ink mapping techniques, result in generalization that reduces locational accuracy. Elevation is perhaps the easiest form of continuous data to visualize geographically. Elevation changes continuously from one point on the earth to the next, no matter how close together the sample points are. A denser network of sample points simply records more and finer variations in the shape of the land surface. To make a visually interpretable map of elevation traditionally meant interpolating lines of equal elevation (contours) between the sampled points (Strahler 1975, 596-602). This generalizes the data by losing specific information about sample points that fall between the contour intervals. With the advent of raster GIS, the sampled points (if at regular intervals) can be made into a grid, with one cell for each point. This preserves the information and allows the data to be represented as a gray scale or scale of gradually changing colors. The data can also be analyzed to create a hillshade display, which is more realistic visually than a contour map.

Another method of representing continuously changing data in vector format is to classify the information and map the data as polygons. Classification, a form of generalization, may result in the representation of spatial information with sharp boundaries between features where, in fact, boundaries are fuzzy (Muller 1991: 460). This applies to many natural phenomena, including vegetation, animal populations, and soil characteristics, such as water holding capacity and pH. In vegetation ecology, the concept that vegetation composition and structure change gradually across the landscape in response to environmental gradients has been increasingly accepted since the 1920s (McIntosh 1967; Scott, J.T. 1974 ). The implication of this theory is that there should be no sharp boundaries between different vegetation communities. Where such boundaries are recognized they are attributed to discontinuities, places where vegetation change occurs over a very short distance because of a steep environmental gradient (O’Neill et al. 1986: 89). An example would be the top of a flood plain where it abuts the bottom of a bluff. Fuzzy boundaries (ecotones) may be recognized where vegetation and environment change across a relatively short distance (measured in meters or kilometers), but not abruptly.

However, vector mapping requires that lines be drawn. This requires a classification scheme to determine which discrete entities are to be mapped. Usually categories within a classification system are based on the co-occurrence of attributes that individually exhibit continuous change. In vegetation classification, this may be the dominance and CO-occurrence of particular species of trees, such as maple, basswood, and elm, to define the vegetation class "Big Woods." In soil classification, the CO-occurrence of a number of recognizable soil factors, including color, texture, horizon development, determines the identification of a soil type. Once these entities are defined and identified in the field or on an aerial photograph, boundaries must be drawn around them.

Determining the locations of boundaries of these entities requires a set of decision rules. One such rule defines the smallest unit to be mapped. This is sometimes, but not always, the smallest feature that can be distinguished in the source material. For example, even though individual trees can be distinguished on aerial photographs, mapping decision rules may specify a larger mapping unit, such as clusters of trees of a certain minimum size. Although this practice results in a loss of information, it usually allows data to be mapped more quickly and may be better suited to a project’s objectives, such as mapping vegetation communities. In some mapping projects, decisions may be made to map data at a certain level of generalization because of time or budget constraints or data availability. The smallest area mapped in the Minnesota Soil Atlas (University of Minnesota, Department of Soil Science 1970-1976) for which reliable information was available is approximately 600 acres, even though these maps were originally rasterized by 40 acre cells (Minnesota State Planning Agency, Land Management Information Center 1989). This determines the true resolution of the generalized soils and geomorphology data used in building the models for this project, as they were derived from this source.

Scale will also be a factor in determining where boundary lines are drawn in such mapping exercises. Because of the resolution of the map base, a 500 meter wide ecotone mapped at a scale of 1:1,000,000 is in fact the width of a pencil line (Table 4.2). Had this same ecotone been mapped at a scale of 1:62,500 it would have been the width of 16 pencil lines (8 mm). A line drawn through the middle of the ecotone would be the typical solution to representing this "boundary," but that line cannot convey the "fuzzy" aspect of the boundary, which is the width of the ecotone.

Another issue is the recognition that features on the surface of the earth are not static. Maps represent the positions of features at one point in time (or within a defined time range over which the data were collected). Lake levels rise and fall dramatically within a few months time. Rivers and streams change their courses, sometimes very gradually and sometimes quite rapidly following major rainfall events. Vegetation patterns change in the short term because of human activities, fire, wind, and flooding. They change in the long term because of climate change, the opening up of new habitats, and the migration of new species into an area. Data collected last year are more likely to be similar to conditions this year than to conditions 10, 100, or especially 1,000 years ago.

4.5.1.2 Attribute Accuracy

The accuracy of attributes in the database tables can be affected by many factors. The accuracy of the source data is foremost among these. Data that have been measured are more accurate than data that have been estimated. Data measured, or estimated, using modern scientific methods are generally considered to be more accurate than data derived from older methods. Older data collected by less reliable methods, as in some older county soil surveys or archaeological surveys, have the potential to skew model results. Finally, there is a greater confidence level with data that have been field verified as opposed to data interpreted exclusively from maps, aerial photographs, or satellite imagery. During this project, the river valley landform maps were the only field-verified data available.

By the same token, data that have been generalized are less accurate than data recorded in more detail (Muller 1991: 460). For example, mean household income reported for a census tract may not accurately convey the pattern of income discrepancies within that tract. The designation of a vegetation community on a map is also a generalization, usually representing the dominant, or most extensive, vegetation type within each polygon. The degree of generalization is usually a function of the minimum mapping unit and of decision rules. For example, mapping procedures may specify that units smaller than one acre will not be mapped. The designation of a polygon as prairie on the Marschner (1974) map of historic vegetation may hide the fact that it contains small stands of trees in some environmental settings. Thus, the attribute "prairie" is not entirely accurate.

Because the characteristics of geographic features are prone to change, recent data are usually considered more accurate than older data. However, this is only the case when the application involves understanding current conditions. Because this study attempts to analyze spatial patterns that were established in the past, older data sources are still valuable when they are available. For instance, maps of vegetation at the time of the Public Land Survey, such as Marschner (1974), are more useful for this project than maps of modern vegetation cover, even though the methods used to collect the data were not scientific.

4.5.1.3 Consistency

Data collected by inconsistent methods are lower quality and less reliable than data collected by scientifically reproducible methods (Kvamme, 1988: 351). Achieving consistent results requires good research design and quality assurance. There are several principles to consider:

Probabilistically based research methods (random surveys, stratified random surveys, censuses, 100 percent surveys) can prevent bias in which locations are sampled, which people are interviewed, or which records are consulted. If stratification is used, however, the stratification rules must also be probabilistically based. Otherwise the strata themselves may introduce very strong bias into the data, as was the case in the MnSAS survey and the first Mn/Model field survey (Sections 5.4.2 and 5.4.3).
Clear decision rules help reduce personal judgment calls, making research conducted by different team members more consistent. Decision rules could include the specific information to be collected at each sample point, how to determine whether the information observed fits into one category or another, and procedures to follow if the designated sample point (location, person) cannot be accessed,.
Early and ongoing quality assurance is necessary to insure that each team member is following the prescribed procedures and interpreting them in the same way.
Detailed documentation of the research methods is necessary so that future research, in another place or at another time, can generate more data for inclusion in the same database or data for comparison to that previously collected.

There are several examples of inconsistent datasets among those used in Mn/Model. The archaeological database is based on data collected over a number of years, by many different archaeologists, using a wide variety of survey techniques and decision rules, and reporting their results with varying degrees of completeness and precision. Efforts were made to make the data more consistent by excluding the most questionable sources (Chapter 5). Ultimately, however, most of the available data were used in modeling, the implications of which are discussed in Section 4.5.2 and Chapters 7 and 8.

Elevation data, a key environmental layer, were inconsistent across the state (Figure 4.2). First, the 7.5 minute (1:24,000 scale) digital elevation models (DEMs) were not completed for the entire state at the beginning of this project. Consequently, 1:250,000 scale DEMs had to be used for parts of the state. Second, the 7.5 minute DEMs were of varying quality. One group of these, the Level 1 DEMs, was built to meet an accuracy standard set by USGS (U.S. Geological Survey 1990). Elevations were collected by the USGS and several independent contractors using various methods. One method produced distortion in the data, which resulted in a banding or striping effect when the data are viewed. This pattern was apparent in some Level 1 DEMs for Minnesota, mostly in the southeastern part of the state (Figure 4.2). Various algorithms can be used to reduce the systematic distortions (see Appendix B). However, because the distortion itself varies, no single algorithm removes it equally well from all DEMs where it occurs. By changing and standardizing methods for building DEMS, USGS eliminated this problem from the Level 2 DEMs. USGS and the State of Minnesota have now cooperatively completed Level 2 DEM coverage of the state, including replacing all Level 1 DEMs.

4.5.1.4 Completeness

GIS data that have not been completed for the area of interest or that are lacking attributes needed for analysis are poorly suited to a project. Several data layers that were considered very desirable for inclusion in Mn/Model were completed for only parts of the state at the beginning of the project. In the case of the 1:24,000 elevation data, lower quality (1:250,000 scale) data were substituted only for quadrangles that were not yet available at 1:24,000 scale. However, this introduced more inconsistency into the Mn/Model elevation data (discussed in Section 4.5.1.3).

Where it was not possible to complete coverage of the state using an alternative data source, data with incomplete coverage could not be used for statewide models (see Section 4.5.2). To assess the value of some of the incomplete data for future modeling, small area models were developed in Phase 2 using variables derived from county soil surveys and digitized Trygg (1964-1969) maps in addition to the variables used for the statewide modeling (Chapter 8).

Other components of data completeness include whether the data have been completely developed or processed to include all of the features needed for analysis and whether these features have been completely attributed. The MnDOT BaseMap includes a layer of lakes and double line rivers digitized from USGS 1:24,000 scale quadrangles. Coverage is statewide. However, the features exist only as line work (i.e. shorelines are represented). For GIS analysis, lakes and double line rivers must be represented as polygons and the polygons must contain attributes that distinguish lakes, islands, rivers, and land from one another. These polygons can be built from the existing lines using additional data processing steps. This is very time consuming, as it involved making sure that all lines are connected (in this case they were not) and adding a label point to each polygon for attaching attributes. Consequently, these data were not yet at a development stage where they could be used for modeling.

Complete attribute data are also important. Desirable attributes may be missing altogether. As an example, the available 1:24,000 scale coverages for lakes and rivers (from NWI) and perennial and intermittent streams (from the MnDOT BaseMap) were not attributed with the names of these features. Alternatively, attributes defined by a data provider may not be the ones needed for a predictive modeling project. For example, when the Mn/DNR 1:100,000 scale statewide geomorphology coverage became available late in the project, it was discovered that the attributes were not readily suited to the derivation of variables of interest to Mn/Model.

4.5.2 Consequences of Data Quality

There are a number of possible consequences of using poor quality data. Using multiple layers of data of varying quality can compound these consequences. The end result is that the quality of the final product, in this case the models, is a function of the composite quality of all of the data layers that went into building it. Some of the specific outcomes of using the lower quality data described above are discussed in this section.

Variables derived from low resolution databases may exhibit coarse spatial patterns (Chapter 8). This is an artifact of the data, which cannot make a distinction between one 30 meter cell and another. If the data were originally rasterized by 100 meter cells, for instance, regridding to 30 meter cells would produce clusters of nine cells containing the same value. If the variable is then used to build a model, the model may display clusters of nine cells with the same probability class. Sometimes these coarse patterns are overwritten in the models by the patterns contributed by variables derived from high resolution data. However, where they are apparent, the user should be aware that they are artifacts of the low resolution of the data and that the model could probably be improved (i.e. have greater spatial resolution) by better quality data.

In the course of this study, considerable bias and error in the available archaeological database has been documented (Chapters 5, 7, and 8). The older archaeological surveys are a particular problem. Because these data were not collected using probabilistic methods, they violate the assumptions of the statistical procedures and limit statistical confidence in the models. In Phases 1 and 2, sites from older surveys and modern surveys that did not meet specific criteria were not used for building models. However, even the carefully selected surveys used for Phase 1 and 2 modeling were demonstrated to be non-probabilistic, based on the distribution of their negative survey points (Chapters 7 and 8). In addition, problems of spatial accuracy were documented during Phase 3 (Chapters 5 and 12). Inaccurate attributes, such as the misclassification of site types or ages, must be assumed to be present as well. Moreover, the number of these qualified sites was very low, particularly in some regions (Chapters 7 and 8). All of these factors have the potential to weaken the predictive power of the models. However, the effects of low site numbers were most apparent in the preliminary models. To reduce these effects, and given that the other data problems documented made these sites less distinctive from the rest, the entire database was used to build the Phase 3 models. Knowing the extent of bias in the data, models of survey bias (survey probability models) were developed in Phase 3 to supplement interpretation of the site probability models.

The implications of the inconsistency of available elevation data in Minnesota are several. As always, data quality is relative to the applications for which the data are being used. Because USGS standards for representation of elevation are met, measurements of elevation and differences in elevation from all 7.5 minute DEMs (Levels 1 and 2) should be accurate at that scale. However, the elevations measured from 1:250,000 scale DEMs are more generalized, therefore less accurate when used at a 1:24,000 scale, as they were in Mn/Model. The quality of Level 1 DEMs that exhibit distortions is poor for the purposes of generating contour maps (Figure B.1) and modeling surface hydrology (Grabrecht and Starks 1995). It is likely that slope, aspect, and measures of local relief derived from these DEMs are also distorted, though there is no way to test this without high quality DEMs from the same area for comparison. Moreover, these same measures will be more generalized when derived from the 1:250,000 DEMs, with the effect that fine scale (1:24,000) relief features are not represented (Figure B.2). It must be assumed, for Mn/Model, that the accuracy of the derived variables surface roughness, slope and prevailing orientation varies across the state.

Data with incomplete coverage may be difficult or impossible to incorporate into a model. In a grid, a distinction is made between the value NODATA and the value "0". Whereas NODATA indicates that data are missing, "0" may mean that the elevation is at sea level, that there is no elevation difference, that the slope is perfectly flat, or that a cell is on the shoreline (0 m from the edge of a lake). Grid calculations that access an input cell with the NODATA value will return an output cell with the same value, even if the same cell in other input grids have valid values. This produces variable grids with areas of NODATA, which would not be accepted by S-Plus for modeling. Grid calculations will be misleading if a "0" or other artificial value is inserted where no data are available. This produces inaccurate variable grids and misleading models. Consequently, the best solution is to not use incomplete data for modeling.

Small area models developed in Phase 2 to evaluate the contributions of incomplete data layers were not completely successful (Chapter 8). First, for some of the areas modeled, the number of available sites was too low to produce a strong model. Second, the reduced environmental variability of small areas may limit the identification of relationships between sites and variables that might be detectable over a larger area. These results bode poorly for the future development of statistical predictive models using the 1:24,000 scale landform sediment assemblage data mapped for Mn/Model (Chapter 12). At the present time, these coverages map a very small percentage of the state’s land area. They include long, narrow river valleys that may cross several distinct environmental zones, as well as several isolated upland quadrangles. To develop these detailed maps for a large enough contiguous area within an environmentally homogenous region to make statistical predictive modeling worthwhile will take years.

Missing attributes may limit the variables that can be derived or the procedures that can be implemented. Because elevation data included elevations of the water surface rather than depths below the surface, it was not possibly to model the potential for finding sites within lakes. Because lakes, rivers and streams were not attributed with their names, the features could not be automatically labeled on the maps produced for this project. Nor is it possible to perform a query to count, for example, the number of sites within 200 meters of Pelican Lake. Moreover, the missing attributes slowed down documenting environmental variations and reporting of model results because digital map products (displaying variables and models) had to be cross-referenced to paper maps to determine the names of prominent features for reference in the written report.

Even when the desired attributes can be derived from the attributes provided with a database, the process can be very time consuming. The project geomorphologists worked with the attribute data from the 1:100,000 MN DNR geomorphology coverage during Phase 3 to incorporate attributes that would be consistent with their landform sediment assemblage maps and project goals. However, the new attributes were ready only after Phase 3 modeling was complete.

4.5.3 Data Confidence Model

In Phases 2 and 3, data confidence models were developed to illustrate the geographic distribution of data quality. Since this project necessitated the use of data that were more generalized than desired, flawed, or inconsistent across the state, data quality may vary not only between models but also within the extent of a single model. Understanding the relative quality of the input data is important for evaluating how much confidence to place in a model. Consequently these data confidence models become important tools for both implementation and for prioritizing future model improvements.

In Phase 2, a very simple algorithm was used to develop the data confidence model. It consisted of the addition of three values: the number of archaeological sites used to build the model, a code for the quality of the elevation data, and a code for the scale of the lowest resolution database that figured into the best model developed for that region (Table 4.3).

Table 4.3. Values for Variables in Phase 2 Data Confidence Model.

Variable	Real Value	Code
Number of sites used to build best model	values < 1000	values < 1000

Quality of Elevation Data	1:250,000 scale	0
	Banded 1:24,000 scale	1000
	Not banded 1:24,000 scale	2000

Scale of lowest resolution database in best model	1:1,000,000	0
	1:500,000	10,000
	1:250,000	20,000
	1:24,000	30,000

The Phase 2 algorithm places the greatest weight on data scale, the second greatest weight on the quality of elevation data, and the least importance on the number of sites used for modeling.

In Phase 3, variables and weights were reconsidered by the project team. The three variables used in the Phase 2 data confidence model were retained, and an additional variable, site density per square km by subsection, was included as well. While site numbers have a strong effect on the performance of models developed, one must also consider the size of the region modeled. Low site numbers, relative to other regions, may be an indication of the region’s size. Low site densities are likely to produce more variable, hence less reliable, model results.

The number of archaeological sites used to build the model and site density per square km were both mapped by ECS subsections (Figure 3.11). The sites used to build the model excluded single artifacts, as these were not used for building the Phase 3 models. Site density was calculated using all known sites, including single artifacts. The raw values for these variables are shown in Table 4.4. The lowest resolution database scale was mapped by ECS subsections (Figure 4.7). The scales of the lowest resolution data are shown on Table 4.4. The quality of elevation data was mapped by 7.5 minute quadrangle (Figure 4.2).

Table 4.4. Raw Values of Three Data Quality Variables in Phase 3 Data Confidence Model, by ECS Subsection.

ECS Subsection	Modeled Sites	All Sites/km²	Lowest Resolution Data
Agassiz Lowlands	53	0.00323	500,000
Anoka Sand Plain	337	0.07108	500,000
Aspen Parklands	59	0.00684	500,000
Big Woods	637	0.08744	500,000
Border Lakes	960	0.11670	500,000
Chippewa Plains	513	0.06543	500,000
Coteau Moraines	220	0.03337	500,000
Glacial Lake Superior Plain	6	0.01247	500,000
Hardwood Hills	470	0.02679	500,000
Inner Coteau	130	0.03806	500,000
Laurentian Highlands	120	0.09052	500,000
Littlefork-Vermilion Uplands	25	0.00491	500,000
Mille Lacs Uplands	437	0.03066	500,000
Minnesota River Prairie	969	0.03298	500,000
Nashwauk Uplands	36	0.00835	500,000
North Shore Highlands	44	0.01294	500,000
Oak Savanna	121	0.01907	500,000
Pine Moraines & Outwash Plains	474	0.03598	500,000
Red River Prairie	270	0.01610	500,000
Rochester Plateau	81	0.01844	500,000
St. Louis Moraines	112	0.03051	500,000
Tamarack Lowlands	74	0.01047	500,000
The Blufflands	554	0.12052	500,000
Twin Cities Highlands (St. Croix Moraines)	126	0.05599	250,000

To build the data confidence model, the values of the four variables were first normalized to a scale from 0-100. Site numbers and site density were normalized by the following equation:

Normalized Value = (Value – Minimum Value)/0.01(Maximum Value– Minimum Value)

The quality of elevation data and the lowest resolution database scale were normalized according to the values in Table 4.5.

Table 4.5. Normalized Values for Elevation Quality and Data Resolution Variables in Phase 3 Data Confidence Model.

Variable	Real Value	Normalized Value
Quality of Elevation Data	1:250,000 scale	0
	Banded 1:24,000 scale	50
	Not banded 1:24,000 scale	100

Lowest resolution database scale for variables in best model	1:1,000,000	0
	1:500,000	33
	1:250,000	66
	1:24,000	100

The normalized values for each variable were assigned weights based on their assumed importance in determining model outcomes (Table 4.6). Numbers of sites and site densities were assigned the highest weights because of their demonstrated effects on model performance and variability (Chapter 7). Elevation quality and resolution have noticeable effects on the spatial patterns of the models and probably also on the variables selected. However, without higher quality data for the same region for comparison, it is impossible to evaluate their effect on model performance. Consequently they were assigned slightly lower weights.

The algorithm for determining data confidence values was:

Data Confidence = å _1-n(Normalized variable value * Variable weight)

The mapping units of the four variables figure largely in the geographic pattern of data quality (Figure 4.1). The effects of elevation data quality in enhancing or detracting from the overall data confidence in a region create patterns with straight line borders. The smallest of these represent individual 7.5 minute quadrangles. The largest represent groups of quadrangles of the same data quality. The effects of site numbers and site densities are apparent by the large patches lacking straight line borders. The resolution of the lowest quality data set used had little influence on the model, as 1:500,000 was the variable value for all but one of the modeled regions. The resultant model shows the highest data confidence in the Border Lakes subsection along the Canadian border, most of the Blufflands subsection along the southeastern border with Wisconsin, and part of the Big Woods subsection in south central Minnesota. All three of these regions have both high site numbers, high site densities, and unbanded 1:24,000 elevation data. A portion of the Minnesota River Prairie subsection in southwestern Minnesota also has comparably high data confidence values. This area has very high site numbers, but low site densities.

Table 4.6. Weights Assigned to Variables in Phase 3 Data Confidence Model.

Weight	Variable
0.75	Number of modeled sites in ECS subsection (normalized)
0.50	Quality of Elevation Data
0.50	Lowest resolution database in model
0.75	Site densities (per sq. km) by subsection

Return to Top

4.6 REGIONALIZATION

In the context of this project, regionalization refers to the practice of dividing the state into a number of smaller, relatively homogeneous areas for the purposes of data development and modeling. This was desirable for several reasons.

First and foremost, a state the size of Minnesota has a wide range of environmental variation. Factors associated with precontact site selection are expected to differ between forests and prairies or glaciated and unglaciated landscapes. It is assumed that some of these environmental regions in general are more attractive to, or were more inhabitable for, hunter-gatherers than others (Chapter 3). Models developed statewide would likely focus on variables with greater variation statewide than locally, losing information about locally important variables.
Furthermore, in Minnesota, known archaeological sites are not distributed evenly throughout the state (Figure 4.3). In many cases, concentrations of known sites correspond to concentrations of surveys (Figure 4.4). At the same time, large areas of the state, particularly in the north and northeast, have not been surveyed. This would undoubtedly bias the high probability areas towards regions with more surveys and more known sites.
Finally, models developed without regionalization would be likely to perform poorly, as the generalized site/environment relationships discerned would not adequately represent the specific situation in any part of the state. Either many sites would not be predicted by the model, or a very large land area would be classified as high/medium probability to predict 85 percent of known sites.

Previous research has involved modeling areas much smaller than the state of Minnesota. Table 4.7 reports the study area sizes for 11 predictive modeling projects that preceded Mn/Model. The largest of these projects (Carmichael 1990) modeled approximately 11 percent of the area covered by Mn/Model. None of these researchers considered regionalization of their models. Consequently, there was not an available precedent for determining the best scheme for dividing Minnesota into regions for the purpose of developing Mn/Model.

Table 4.7. Comparison of Mn/Model Study Area Size with that of Selected Previous Predictive Models.

Study Area Location	Source	Study Area Size (km²)
Minnesota (statewide)	Mn/Model	207,256
North-Central Montana	Carmichael 1990	22,021
Black Sturgeon Lake Area, Ontario	Della Bona 1994	1,973
Regional Municipality of Waterloo, Ontario	Young et al. 1995	1,390
Brightsand, Ontario	Della Bona 1994	1,204
Fort Hood, Texas	Williams, Limp, and Briuer 1990	878
Raccoon River Greenbelt, Iowa	Anderson 1995	539
Piñon Canyon, Colorado	Kvamme 1992	460
Abitibi Block 6, Ontario	Della Bona 1994	392
Eastern Prairie Peninsula, Illinois	Warren and Asch 1996	322
Sparta Mine Area, Arkansas	Lafferty et al. 1981	142
Central and Lower Passaic River Basin, New Jersey	Wise 1981	93
Western Shawnee National Forest, Illinois	Warren 1990a	91
Passaic River Basin, New Jersey	Hasenstab 1991	20 (scattered)

4.6.1 Phase 1 Regionalization

Initially, consideration was given to the development of regions based on like sets of environmental variables. However, to derive such a classification scheme would be a major research project in itself. For practical reasons, previously defined regions suitable for the model were sought. MnDOT Transportation District Boundaries were considered, but rejected because they were not based on either environmental or archaeological criteria. Attention turned to the nine archaeological resource regions (see Figure 3.10) and their subregions defined for the state of Minnesota by Anfinson (1990). The characteristics of these regions, their derivation, and reasons for selecting them for this project are described in Section 3.4. These nine regions were used for the initial regionalization of the models in Phases 1 and 2.

The regions are:

1.        Border Lakes
2.        Central Lakes Coniferous
3.        Central Lakes Deciduous
4.        Lakes Superior
5.        Northern Bog
6.        Prairie Lakes
7.        Red River Valley
8.        Southeast Riverine
9.        Southwest Riverine

In Phase 1, five regions and, within them, only the counties with probabilistic surveys were modeled (Figure 4.5). The regions modeled were Central Lakes Coniferous, Central Lakes Deciduous, Prairie Lakes, Southwest Riverine, and Southeast Riverine. The remaining regions (Red River Valley, Northern Bog, Lake Superior, and Border Lakes) were not modeled in this phase because they did not contain counties with probabilistic surveys.

Each Phase 1 region contained between 41 and 190 archaeological sites from probabilistic surveys that were used to create models. However, it was observed that models based on small samples, covering small areas (41 in Southwest Riverine, 31 in Nicollet County) performed as well as or better than models that used larger samples in larger areas. This implies that increasing the size of the region has a stronger negative effect on model results than reducing the sample size and area. Presumably, this is because smaller areas are more environmentally homogeneous.

4.6.2 Phase 2 Regionalization

When all 87 counties were modeled in Phase 2, some large regions were divided into their subregions (Figure 4.6) to increase the amount of environmental homogeneity within each area modeled and to reduce the digital data to more manageable file sizes. All 21 subregions defined by Anfinson (1990) would have been used for modeling had site numbers not been a problem. Consequently, modeling in Phase 2 was organized by the groupings of subregions indicated in the second column of Table 4.8. Where enough sites were present, subregions were modeled separately. Where too few sites were available for modeling, adjacent subregions were combined. In all, 15 models were developed.

Table 4.8. Archaeological Subregions and Subregion Groups Modeled in Phase 2.

Archaeological Resource Subregion	Corresponding Phase 2 Modeled Region
Border Lakes	Border Lakes
Central Lakes Coniferous Central Central Lakes Coniferous North Central Lakes Coniferous West	Central Lakes Coniferous Central, North and West
Central Lakes Coniferous East Central Lakes Coniferous South	Central Lakes Coniferous South and East
Central Lakes Deciduous East	Central Lakes Deciduous East
Central Lakes Deciduous South	Central Lakes Deciduous South
Central Lakes Deciduous West	Central Lakes Deciduous West
Lake Superior North Lake Superior South	Lake Superior
Northern Bog East Northern Bog West	Northern Bog
Prairie Lake East	Prairie Lake East
Prairie Lake North	Prairie Lake North
Prairie Lake South	Prairie Lake South
Red River Valley North Red River Valley South	Red River Valley
Southeast Riverine East	Southeast Riverine East
Southeast Riverine West	Southeast Riverine West
Southwest Riverine	Southwest Riverine

Two problems were noted in the Phase 2 models that were thought to be possibly attributable to the regionalization scheme used. One of these was an edge effect, where linear boundaries of the Southeast Riverine and Northern Bog regions were strongly reflected in the distribution of model probability classes (Figure 8.4). Had these regional boundaries been based on natural features, such as rivers or ridges, the anomalies may have been less noticeable. However, it was immediately apparent that these straight-line patterns of high probability cells were not related to any natural features that would have influenced the locations of archaeological sites. Nor were they artifacts of the environmental data, such as boundaries between elevation data of different scales.

The second problem related to the regionalization scheme was that many of the archaeological resource regions and subregions contain a relatively wide range of environmental variability. The best models in Phase 2, as in Phase 1, came from regions with little environmental variability. The issue was not necessarily the size of the region modeled, but whether it overlapped into two or more geomorphic regions or vegetation zones when compared to the environmental data.

4.6.3 Phase 3 Regionalization

Towards the end of Phase 2, the Ecological Classification System (ECS) sections and subsections (Figure 4.7) became available in GIS format for Minnesota (Hanson and Hargrave, 1996). The derivation of this classification system is detailed in Section 3.5. Since it was developed to identify, characterize, and delineate units of land with similar climatic, geological, physical, and biological features, it was hypothesized that its use as a regionalization scheme for Mn/Model Phase 3 would provide regions with greater environmental homogeneity than was previously available. Moreover, since ECS boundaries are based entirely on natural features, it was hoped that the edge effect observed in Phase 2 could be reduced or eliminated by its use.

This regionalization scheme had several other advantages for Mn/Model.

It is an integrated multi-factor classification scheme, based on climate, geology, geomorphology, and historic vegetation. As such, it is conceptually compatible with the multivariate approach used for modeling.
It was developed by a team of scientists using a comprehensive, documented interdisciplinary approach. Consequently, it is more likely to be replicable than is the archaeological resource region scheme that was developed by an individual.
It is conducive to modeling at multiple scales. In the future, modeling can be extended to lower levels of the hierarchy (smaller than Subsections).
Because it is being developed nationwide, its use provides the potential for consistency across states and between agencies.
Regional and subregional boundaries are based on natural features, primarily geomorphic features that have been in the same place since the end of the glacial period. The archaeological resource regions and subregions are largely delineated by modern political boundaries or relatively arbitrary straight lines.
The ECS boundaries were defined at a higher spatial resolution than were the archaeological resource region boundaries.
The ECS subsections were much more homogeneous with respect to our independent variables than were the archaeological regions, thus reducing "noise" in the analysis.
Finally, it was designed to facilitate the understanding of relationships between single environmental components within homogeneous environmental regions. This is consistent with Mn/Model’s modeling approach.

For these reasons, ECS sections and subsections were adopted as the regionalization scheme for the Phase 3 models. This decision has been criticized locally by proponents of the Archaeological Regions used in Phases 1 and 2. Their argument is that there is archaeological meaning to that classification scheme, whereas archaeological data were not considered in development of the ECS. However, Mn/Model is by necessity an environmentally deterministic model. Too little is known about Minnesota's prehistoric cultures to incorporate cultural data into the modeling scheme. Since the only alternative is to predict site locations based on the characteristic environments in which human activity was concentrated, it is important that the regions modeled be environmentally homogeneous to the greatest extent possible. Heterogeneous environments otherwise create uninterpretable noise in the models and may have attributed to the edge effects observed in Phase 2.

4.6.3.1 Concerns Regarding the Application of ECS Regions to Mn/Model

Concern was expressed that the ECS regions were developed for natural resource management, not archaeological applications, and were therefore more representative of recent environments than might be appropriate for this project. However, the archaeological resource regions are also based on recent environments, emphasizing the characteristics of the Late Holocene and historic periods (Section 3.4). The Mn/Model team felt the ECS regions were most appropriate for modeling because of the number of environmental factors considered, the multi-disciplinary methods used for their delineation, the correspondence with significant geomorphic discontinuities, the resolution and quality of their source data, and their environmental homogeneity.

The greatest flaw with the use of modern data for regionalization is that climate zones and biotic provinces have shifted dramatically across Minnesota in the last 10,000 years (Figure 3.1). Geomorphic and topographic features change relatively slowly. Although erosion and deposition can transform local landscapes very quickly, the landforms created by Pleistocene glacial advances and retreats still dominate the Minnesota landscape (Figure 3.9). However, over the last 10,000 years these landforms have existed in constantly changing climates and biotic provinces (Figures 3.1, 3.4, 3.5, 3.6, and 3.8).

However, the contribution of time-transcendent topographic and geomorphic features to the local microclimate and patterns of vegetation distribution should not be underestimated. Areas historically protected from fire by virtue of their relief or their situation with respect to water bodies would likely have had the same kind of protection in the more distant past (Grimm, 1981). While these areas may have supported stands of Big Woods in the contact period, they likely supported oaks or other more drought resistant trees or shrubs in the drier prairie period. Consequently, they may have held the same kind of attractions for human settlement, fire protection and a source of fuel, throughout time. This does not, however, negate the importance of climate change's effects on settlement pattern on a larger scale, in the movement of people to different regions in search of wildlife or other resources and the very important local effects on lake levels and shifting stream courses. In addition, drier periods may have reduced site density and increased the probability of artifacts being buried by wind or water borne sediment.

To the limited extent that ECS regions rely on historic vegetation boundaries for their definition, they suffer from being time-static. However, the vague definitions of the archaeological resource region boundaries do not necessarily impart time-transcendence. Ideally, there should be different regionalization schemes for several time periods or cultural complexes in Minnesota's past. The potential for creating such regionalization schemes is discussed in Section 4.6.4.

4.6.3.2 Mitigating Factors

However, even the modern distribution of ecosystems within the landscape can provide useful information about past environmental patterns. This is because of the intimate nature of the relationship between vegetation and soils. The formation of distinctive soil types has traditionally been considered a function of climate, organisms (especially vegetation), parent material, topography, and time (Brady 1974:303; Eyre 1968: 25-44; Jenny 1980: 9-14, 160-197). Jenny (1980: 201-203) takes a different perspective, viewing soils and organisms as the ecosystem, which evolves together. He considers climate, topography, parent material, and time to be "state factors," which may act as controls on the evolving ecosystem, but are independent of it. These state factors may or may not be independent of each other. Yet his examination of the evolution of several ecosystems emphasizes both the great lengths of time that may be involved and the importance of variations in parent material (Jenny 1980: 207-245). Moreover, he determines that soil properties evolve slowly compared to the potentially rapid rates of vegetation change (Jenny 1980: 243). This relatively slow rate of soil development can be attributed to the time required for processes to change the nature of parent material and relief.

Where erosion and deposition are not prevalent, parent material and topography can be constant over long periods of time. These play a key role in determining microclimate and the relative moisture content of soils within a local landscape (Bannister 1976: 18-20; Jenny 1980: 39-41, 246-304; Scott 1974: 51-55). Although vegetation is an important contributor to microclimate and moisture relations, parent material or topography may limit the establishment of some kinds of vegetation, which are otherwise climatically adapted, in parts of the landscape (Cooper 1961; Etherington 1975: 154-216; Mooney 1974; D. Scott 1974).

Moreover, topography may control microclimate in ways that produce profound constraints on vegetation and soils. Propensity for flooding, for instance, will preclude the growth of vegetation that cannot tolerate saturated soils. In Minnesota, Grimm (1981, 1984) concluded that fire was the primary factor controlling the distribution of mesic deciduous forest, oak savanna, and prairie within the Big Woods region. Water bodies and topography strongly controlled fire patterns. He observed that "sites with identical physical characteristics but slightly different geographical locations, such as on the opposite sites of a river, can support very different types of persistent vegetation" (Grimm 1981: 176).

It is likely that such strongly controlling factors are to a certain extent time-transgressive. The Big Woods region was dominated by prairie during the mid-Holocene, was invaded by oak woodland beginning about 5,000 B.P., and was rapidly invaded by mesic hardwoods only about 300 years ago (Grimm 1983). These changes were most likely induced by increased precipitation and possibly decreased temperatures. Undoubtedly, however, fire has played a role in the region throughout at least the last 5,000 years. While oak savanna and prairie now persist only in the most fire-prone parts of the landscape, it is likely that the most protected places served as refuges for oaks and mesic tree species when the climate regime was drier and prairies were dominant. Historic vegetation patterns can therefore hold the key to past patterns and may prove useful to reconstructions of paleolandscapes.

4.6.4 Potential for Regionalization Based on Past Environments

Although an archaeological predictive model could clearly be improved by the use of a regionalization scheme based on information about earlier environments, there are obstacles to such an implementation.

To support high resolution modeling statewide, the regionalization scheme must define at least 15-20 distinct regions. The development of such a scheme would require the availability of information about earlier environments at a scale of 1:500,000 or better. Paleoclimate and palynological data have not been developed at that scale in Minnesota. An evaluation of available palynological and paleoclimate data can be found in Section 6.3. Neither was deemed to provide the resolution needed to make them useful for Mn/Model.
There is no single past environment. Incorporating specific paleoclimatic information would require definitions of not only regions but also of time periods. Such an approach would necessitate modeling only sites of a specified age or cultural affinity. This greatly reduces the number of sites available for modeling. Moreover, since this information for many sites in Minnesota is not known, site numbers would be reduced even further. Given the very low numbers of Archaic and Paleo-Indian sites in Minnesota, it would not be possible to model these statistically.

Most known archaeological sites in Minnesota date from the last few hundred years. Models based on ECS subsections should be appropriate for the majority of these sites. They are expected, however, to do a poor job predicting the older, but less abundant sites in the state. However, even if higher resolution paleoclimate and paleoecological data were available for modeling these sites, the low site numbers would prevent the development of these models using statistical methods. Given the constraints of site numbers, only one time-generalized model is now possible for Minnesota.

Return to Top

4.7 PROCEDURAL DESIGN

Not only is it important to design the physical aspects of the GIS system, it is critical to design the procedures that will be used for data development, maintenance, storage, security, analysis, and presentation. In a large project such as Mn/Model, design and documentation of all procedures provides for consistent results, products that will be suitable for subsequent tasks, and continuity in the face of staff changes. Subsequent automation of procedures, to the extent possible, provides further insurance of consistency. This does not imply, however, that these procedures should be written in stone. Provisions for correcting and updating procedures on the basis of experience, lessons learned, and changing project requirements must be part of the design.

For large GIS projects, it is customary to start with a pilot. This involves collecting the necessary project data for a very small area, developing procedures for the entire range of project tasks, and testing procedures on the pilot area. This has a number of advantages:

It allows the evaluation of a wide range of data for the project. Data sets that are determined to be unsuitable can be discarded with little loss in investment.
Since small geographic areas process more quickly, a pilot is an opportunity to evaluate a number of techniques in a relatively short time.
A mistake made on a pilot county is a learning experience. A mistake made on the entire state is a considerable amount of time lost.

Careful selection of the pilot area is important. For Mn/Model, the pilot area selected was Nicollet County. It had several advantages:

Databases that were not yet available statewide at the beginning of the project were available for Nicollet County. These included National Wetlands Inventory, 7.5 minute DEMs, and high-resolution soils.
There were a relatively large number of known archaeological sites in the county, including sites from a probabilistic survey, which would allow development of prototype models.
The county included a portion of the Minnesota River Valley, the first valley mapped by the project geomorphologists.
It is a fairly small county, compared to others in Minnesota. A small pilot area size reduces data storage and processing requirements.

There are, however, some lessons that can’t be learned in the pilot phase.

Experience on a pilot may not be a good basis for estimating the time required for modeling the entire state. The time required for some GRID functions increases linearly with the number of cells processed; for other functions, the time increases exponentially.
Not all potential pitfalls will be discovered in the pilot. Expect more and bigger pitfalls when the project is underway.
The pilot area may not include all of the data problems that will be encountered for the state as a whole. Procedures that worked smoothly in the pilot may have fatal flaws when applied to other geographic areas.
Pilots may not provide the information needed to refine modeling techniques for large areas. Models derived for relatively small areas with large numbers of sites tend to perform well. When the same procedures are used for larger, more diverse areas, particularly those with low site numbers, unanticipated problems arise.

4.7.1 GIS Data Development

GIS data development includes both data conversion and data integration. Data conversion refers to the transformation of data from paper to digital format or from one digital format to another. Integration refers to processing necessary to allow different GIS data sets to work together. A distinction is made between vertical integration, which registers GIS layers for accurate overlay, and horizontal integration, which involves combining adjacent GIS datasets of the same theme so that the resulting digital map is topologically consistent (features are correctly connected at the adjacent map edges).

Data conversion and integration were the most time-consuming tasks for Mn/Model, as is common in most GIS projects, even though few datasets had to be digitized from paper maps. Conversion of data from EPPL7 GIS format to the ARC/INFO GRID format required for Mn/Model was involved because there was no direct translator available. For the majority of the data, which arrived in digital form, considerable processing was necessary to integrate it properly. Several processing tasks were commonly required for each layer (see Appendix B and the Mn/Model metadata). In addition, the processing of many large databases required careful management of file sizes and computer resources and formal quality control procedures.

4.7.1.1 Data Conversion

Data conversion is the most basic GIS procedure. Developing the models using ARC/INFO GRID software required that all data be in ARC/INFO coverage or grid format. The types of data conversion required for this project are summarized below.

Paper to Digital Conversion

Some geographic data were not available in digital format. Several layers that were considered potentially important were digitized from paper maps. These included Trygg maps, bedrock geology for some southeastern Minnesota counties, and the Mn/Model geomorphology project field maps. None of these layers was completed for the entire state because of time constraints. These layers could be used for modeling only in limited areas. Fortunately, none were considered critical for model development. Because digitizing takes many hours, it is not realistic to begin a statewide modeling project until statewide coverage of all critical layers is available digitally.

For Mn/Model, all digitizing was done using AutoCAD. The AutoCAD drawings were converted to PC ARC/INFO coverages (with a y-shift to preserve precision), then to ARC Export files, using ArcCAD software. From the ARC Export files, ARC/INFO coverages, then grids, were created. Digitizing procedures are described in more detail in Appendix B.

Generating Digital Maps From Coordinates

Point coverages can be generated from x and y coordinates contained in a database file. Where locations of data observations have been recorded as latitude/longitude or UTM coordinates read from topographic maps, this is useful procedure for creating a spatial database. The original database file can be joined to the resulting point coverage attribute table.

Coverages of the locations of archaeological sites and random points were generated from x/y-coordinates (UTM) recorded in digital database files received from SHPO (see Section 5.5 and Appendix B). Most of these coordinates were in NAD27. The coverages created were projected to NAD83, then converted to grids. Projection of the coverage does not change the values of the UTM coordinates in the attribute tables, which retain the original NAD27 coordinates.

Climate and pollen data were received as ASCII files with coordinates of weather stations and pollen cores in latitude/longitude decimal degrees. Point coverages were generated from these files, then projected to UTM NAD83. Climate and pollen surfaces were interpolated from the coverages in GRID (see Section 6.3 and Appendix B).

Digital to Digital Conversion

Most of the digital data received were in ARC/INFO coverages (vector format) and had to be converted to ARC/INFO grids (raster format) for variable derivation and modeling. However, the MGC100 database and many county soils databases came in EPPL7 raster format. These had to be converted to ERDAS images using EPPL7 software from LMIC (see Appendix B). The ERDAS images could then be converted to ARC/INFO GRID format.

4.7.1.2 Horizontal Data Integration

For this project, horizontal integration consisted of taking the data in whatever tiles it was provided and manipulating it so that it was tiled by the regions used for variable derivation and modeling. In Phase 2, variables were derived for the same 21 regions that were used for modeling. In Phase 3, in the interest of saving time, ECS subsections were combined into nine very large regions for the purpose of deriving variables. For Phase 3 modeling, the variable grids for each ECS subsection were extracted from these nine regions as needed.

Throughout this project, counties were used as the most basic tiling unit for two reasons. First, counties are smaller than potential modeling regions. They can be assembled into a variety of combinations to create regions. This allows flexibility in the definition of regions and allowed Mn/Model to easily move from archaeological resource regions in Phases 1 and 2 to ECS subsections in Phase 3. Second, MnDOT specified that final delivery of all data and models be tiled by counties to facilitate distribution via the MnDOT GIS server. Tiling the data by counties in the beginning saved work later in the project when deliverables were created.

Some data received were tiled by units smaller than counties. This required appending the data to form the county coverages or grids. National Wetlands Inventory and USGS DEMs were both distributed as one digital file per USGS 7.5 minute quadrangle. There are 1,745 quads in Minnesota. These were consolidated to form 87 county files. Likewise, soils data received as township files were joined to make counties.

Other digital data layers were received as statewide maps, in either raster or vector format. These included all of the MGC100 layers, the Marschner map, the DNR geomorphology coverage, tree species distributions, and bearing trees. These layers were split into regions for variable derivation and into counties for final distribution. Performing two splits (regional and county) is much more efficient than splitting by counties, then appending counties into regions.

4.7.1.3 Vertical Data Integration

All maps in the GIS must be carefully standardized to overlay properly. Most commonly, this involves map projection (see Section 4.3.1 ). Digital maps are distributed in a variety of projections, units, and datum. To make these data compatible with the standards established for the project sometimes required projecting from other map projects or coordinate systems to UTM. Projection was also needed to convert data already in UTM coordinates from NAD27 to NAD83 or to convert units from feet to meters.

Vertical integration of grids involves additional considerations. First, all grids must have exactly the same cell size to overlay properly. For Mn/Model, this cell size was 30 meters. Second, all grids must have the same starting point. In other words, the lower left-hand corners of the grids must have exactly the same geographic coordinate. If this is ignored, grids created by analysis of two or more misaligned grids can contain gaps containing no data. To insure that all grids register correctly, it is advisable to create a single statewide grid at the onset of the project and to match all subsequent grids to that grid. That this grid be statewide is critical. If separate regional grids are used for matching, there is no guarantee that they will align exactly when merged. For Mn/Model, the statewide elevation grid was used for this purpose. All subsequent grids were matched to that grid.

4.7.1.4 Systems Considerations

In determining system requirements, storage and processing demands must be considered. Both will depend on the size of the databases. Database size will vary with the geographic extent of the data, the number of layers, and the data format. With raster data, grid size will also depend on the cell size, the number of values in the grid, and the way the grid values are stored.

Grid values can be stored as either integers or floating point numbers. Floating point grids can be very large. The floating point elevation grid for Nicollet County, a relatively small county without much relief, was 15 MB. Integer grids for the same county are generally between 3-6 MB. Grids derived by distance functions (i.e., distance to lakes) are even larger. In Nicollet County, one was 400 MB when stored as floating point, 90 MB as integer. Early in the project it was determined that it would be necessary to convert all grids from floating point to integer, both to save disk space and to speed processing. Also, some GRID functions (such as EUCDIST – Euclidean distance - and COMBINE) work only with integer grids.

For data archiving and backup, multiple redundant systems were maintained. First, the entire system was regularly backed up to tape. Second, whenever data were received on tape or CD-ROM, the original media were stored in a locked drawer after downloading the data to the hard drives. Third, when a significant processing step was completed (for instance, splitting a statewide grid into county grids), the results were archived to a tape. Finally, county data, derived variables, models, graphics, and other important files were written to CD-ROM for use at MnDOT and for the project archives.

4.7.1.5 Quality Control

A variety of quality control procedures were established for this project to ensure quality and consistency in the conversion of data, the analysis of data, and the production of products. Quality control was maintained by training, supervision, and documentation of procedures, macros, and metadata. It was also maintained by careful evaluation of apparent problems in incoming data. For example, USGS 7.5 minute DEMs that exhibited banding were carefully examined to determine how much the error would affect the analysis.

Throughout the project, written work orders were issued for all technical procedures. The procedures were tested, errors in the work orders were noted, procedures were refined, and work orders were updated to reflect these changes. Formal work orders, with step-by-step procedures, insured that consistent results were achieved even though several staff might work on the same procedure. These work orders also became the first level of documentation for the project. At each phase of the project, work orders from the previous phase were reviewed, refined, updated, and reissued, if appropriate.

The GIS Standards and Procedures document (Appendix B) was developed from the work orders and used to serve as a record of what was accomplished and a guide to proceeding consistently and deliberately toward completion. It included documenting procedures for processing and converting GIS data, for defining model variables, and for data analysis.

At the end of Phase 1 of the project and the completion of the initial model of the Phase 1 counties and regions, a major quality control effort was completed. Quality control procedures were refined and formalized for both general and specific procedures. The data, variables, and models for the Phase 1 counties were thoroughly checked for errors. Any procedures found to be flawed were corrected, and any subsequent errors were repaired. Quality control measures were incorporated directly into procedures. Instructions and some macros for the revised procedures were issued and used for Phase 2 of the project.

Additional quality control was performed at the end of Phase 2. For Phase 3, Phase 2 procedures for deriving variables and modeling were further refined and automated to a greater extent. Automation insures that procedures are carried out in exactly the same way on each data set. However, it does not insure that the results for each data set are as expected. Data set errors or irregularities can produce unexpected results. Quality control, including visual checking of grids produced by automated results, is still essential.

Despite quality control efforts, distributed data may retain some errors. Many problems were in data received, and correcting them was beyond the scope and budget for the project. For instance, USGS 7.5 minute DEMs for adjacent map sheets do not always meet at the same elevation. When joined, there will be an apparent bluff along the quad edge. This error will then propagate through any layers derived from the elevation data and may even be apparent on the models.

Finally, metadata are an important component of quality control (see Section 4.3.4). Digital data are not useful without metadata to explain their source and content. Metadata also document any known problems. As not all errors can be corrected, they should be documented in the metadata so data are not used inappropriately. Metadata developed for this project conform to the Minnesota Geographic Metadata Guidelines (Minnesota Governor’s Council on Geographic Information GIS Standards Committee 1996), which are based on the FGDC Content Standards for Geospatial Metadata.

4.7.2 Operationalizing Variables

Model variables were derived from the data layers. Variables describe that component of the data used for locating archaeological sites. Many different variables can be derived from a single environmental layer. Most of the environmental variables derived for this project were measurements of distances to key resources. Other variables measured the diversity of resources in an area. Still others were derived from complex measurements (i.e., deriving slope and aspect from elevation) of landscape characteristics.

Variable definition is an important aspect of model building. Variables must meet certain criteria to be included in the model:

They must be measurable from the available data. For example, the variables horizontal and vertical distance to water can be derived from the position of known water bodies and elevation data.
They must have some potential significance to human settlement or activity patterns.
It must be possible to derive them from an algebraic formula or series of algebraic formulas.

Many categorical data (presence or absence of hydric soils, vegetation type) were not used directly as variables, but were represented by continuous variables derived from them (distance to hydric soils, distance to woodlands). These features were considered resources that were important for their proximity to an archaeological site rather than for being at the site. On the other hand, a few categorical variables were based on the presence or absence of a feature. An example would be on alluvium, as it was considered more important that a site actually be on or off of alluvium, rather than simply near alluvium. Bluff top sites, for instance, would be near alluvium but would not have the same characteristics as sites on the flood plain.

Certain environmental layers evaluated were not useful because their interpretation was not apparent. Some data from MGC100, for example, were classified into nonexclusive categories, such as depth to bedrock categories of 0 - 2+ feet; 0 - 20+ feet; 1+ feet; 1-4 feet; 2 - 4+ feet; etc.

Some derived variables were rejected after testing because their scale was deemed inappropriate to the resolution of the data or the scale of patterning of the features. Vegetation diversity variables measured within 30 and 90 meters were rejected, but vegetation diversity variables measured within 510 and 990 meters were kept.

Approximately 40 basic variables were used in developing the Phase 1 models for each region. These included variables derived from:

Elevation

National Wetland Inventory (lakes and wetlands)

Streams

Coarse resolution soils

Archaeological database

Tree species distribution

Enhancements to the Phase 2 models were based on statewide layers that became available later. These included:

Marschner map

Coarse resolution quaternary geology

Coarse resolution water erosion

Coarse resolution sedimentation

Tree species distribution

Additional Phase 2 enhancements to some regions or counties were based on data that are not available statewide:

High resolution soils (available for 27 counties) (Figure 6.1)
River valley landforms (8 river valleys and one basin) (Figure 6.23)
Trygg maps (for 20 counties) (Figure 6.2)
Bedrock geology (for counties in the Southeast Riverine Region)
Coarse resolution wind erosion

Phase 2 modeling utilized 68 variables that were available statewide and up to additional 17 variables for the enhanced models. The performance of these variables was reviewed at the end of Phase 2. Variables that appeared in very few models or that appeared to be redundant with other variables were eliminated. Only 37 of the 69 statewide variables were carried over into Phase 3. To these were added six variables derived from bearing trees and watershed data layers, for a total of 43 statewide variables. One additional variable, distance to bedrock used for tools, was carried over from Phase 2 for modeling in southeastern Minnesota only. The archaeological and environmental variables are listed, defined and discussed in Chapter 6. Their derivation is detailed in Appendix B.

4.7.2.1 Buffering Regions

Many variables derived were measurements taken from cells to the nearest feature of a certain type (lake, stream, etc.), yet the nearest feature to some cells was outside the region in which the cell was located. This could have resulted in incorrect variable values around the edges of regions. This error would multiply as the number of regions studied increased and their areas decreased. To mitigate this problem, in Phase 1 and 2 1000-meter (1 km) buffers were generated around regions. All environmental data extended to the outer edge of this 1 km buffer. Thus, when the variable "distance to nearest lake edge" was calculated for a region, cells close to the regional boundary might measure distances to a lake in the adjacent region, but within 1 km of the region boundary. There was still a potential for error. The nearest lake may be 1.5 km from the region boundary. However, it was assumed that features more than 1 km distant would probably not have had a strong influence on site location. In Phase 3, the potential for this kind of error was further reduced by using a 10 km buffer and reducing the number of regions to nine.

4.7.3 Modeling Procedures

Modeling is a multi-step procedure (Kvamme 1988, 1990; Warren 1990b). It involves data preparation and analysis, then model application, classification, evaluation, and refinement. These components of modeling are touched on briefly in this section. Modeling procedures used for Mn/Model are discussed in detail in Chapter 7.

4.7.3.1 Database Preparation

Statistical modeling requires first building a database for analysis. This database must contain information about the cells being modeled, including their archaeological status and values of environmental variables. Archaeological status for Mn/Model consisted of codes for sites of several types, negative survey points, and random points. For logistic regression analysis, which is based on presence/absence, these codes were further simplified in the database to site present or site absent. The site present determination was made based on which types of sites were being modeled. The site absent determination was made based on whether negative survey points or random points were being used for comparison.

The database must be consistent with the requirements of the statistical software and the analytical procedures selected. For logistic regression in S-Plus, this meant that it had to be in ASCII format, that there could be no null (NODATA) values contained in the data, and that all values had to be expressed numerically.

4.7.3.2 Statistical Analysis

Statistical analysis can be performed in the GIS software itself (if the functions desired are available) or in a separate statistical software package. This project utilized both ARC/INFO GRID and S-Plus for statistical analysis. However, GRID was found to be quite limited and ultimately S-Plus was used exclusively to perform the Phase 3 analysis.

The statistical software can perform a number of functions. For Mn/Model, these included:

Transformations of derived variable values to square roots, sines, and cosines.
Identification of pairs of variables that were 100 percent correlated or individual variables that contained no variance within a region.
Selection of the best combinations of variables for models.
Determination of the appropriate weights to assign to variables in a model. In Mn/Model, these were the logistic regression coefficients.
Performing tests of significance on variable and model values.
Producing histograms or other graphics illustrating data trends.

Selection of the best variables and determination of their weights are the key steps to identifying the models. For Mn/Model, this was accomplished in S-Plus using logistic regression. This procedure analyzes the database and returns several models, consisting of lists of variables and their coefficients. It also provides information useful for interpreting the relative merits of the models and of the individual variables.

4.7.3.3 Applying and Classifying Models

A statistical model is simply a group of variables and their associated weights that can be entered into an equation to determine a probability value for the presence of archaeological sites. To translate this mathematical model to a map of site potential, the equation must be applied to every cell in the landscape being studied. This is readily done in raster GIS. The result is a probability surface in the form of a grid wherein each cell contains a raw model value indicating its potential for containing archaeological resources. Application of a logistic regression equation provides a probability surface with a continuous range of values between 0 (no potential) and 1 (highest potential).

A probability surface may provide a good visual representation of site potential, but it is difficult to evaluate and interpret. Reclassifying the potentially infinite number of raw model values to a smaller number of discrete categories simplifies matters. For Mn/Model, a number of decision rules were tested for classifying model values into high, medium, and low probability categories (see Chapter 7). These rules define cutoff points in the raw model values that will be used to distinguish between low, medium, and high probability categories. The decision rules are applied to the probability surface to create a grid of high, medium, and low site potential. This grid is then evaluated, interpreted, and presented as the model.

4.7.3.4 Evaluating Models

The goal of model evaluation is to assess the performance of models to determine if they are achieving project goals and to provide a basis for selecting the best models from among a number of alternatives. Throughout Mn/Model, the first evaluation performed on each model developed was to determine whether known archaeological sites were more likely to be in areas of high probability, as predicted by the model, than in areas of low probability. For example, assume that the assignment of cells to the high, medium, and low probability was made randomly. In this case, approximately 1/3 of the land area and 1/3 of the known sites should fall into each probability class. However, the assignment of values by a model should not be random. Assignment of 1/3 of the land area to each of the three probability classes on the basis of model values should provide a different, nonrandom pattern. If the model works well, more than 1/3 of the sites should fall in the high probability area, less than 1/3 in the medium probability area, and much less than 1/3 in the low probability area. If this is not the case, then the model is performing poorly.

This relationship between the area in each probability class and the number of sites that fall within that area can be summarized using Kvamme’s gain statistic. Gain is calculated as:

1 - (percent area in high and medium probability/percent sites in high and medium probability)

The resultant values can be used quantify model performance and to compare the results of different models (see Chapter 7).

Evaluation of model performance with respect to project goals is also important. The Mn/Model goal, in Phase 1, was to develop a model that placed 85 percent of the known sites in the high and medium probability areas and that these should be no more than 66 percent of the landscape. Based on the performance of the Phase 1 models, this goal was raised, in Phases 2 and 3, to place 85 percent of the known sites in high and medium probability areas, which should be no more than 33 percent of the landscape.

Model evaluation was performed for several sets of data points. Ideally, each model should be evaluated with data that were not used in its construction (it is assumed that models will predict the training data). In Phases 1 and 2, and in the first stages of Phase 3, groups of known sites were set aside for model evaluation. These testing data were not used in developing the models. Therefore, evaluation could be conducted by looking at how many known sites, from the test group, occurred in areas predicted to be high, medium, and low probability. Other sets of data points evaluated included all known sites (both the training and testing populations), all random points, and all negative survey points.

In Phase 3, an additional evaluation was performed to determine model stability. This procedure involved developing models from different halves of the database and analyzing how many cells were assigned the same probability values in both models. A Kappa statistic was calculated from the results of this analysis (Chapter 7). This evaluation provided additional evidence of the detrimental effects of low site numbers on model performance and preceded the development of models based on the entire database.

Tables in Chapter 8 present the result of these evaluations for each of the Phase 3 models. These do not include evaluations of the testing data, as these data were combined with the training data in the last round of Phase 3 modeling. Refer to Chapter 7 for a complete discussion of the modeling and evaluation procedures.

4.7.3.5 Refining Models

There are a variety of reasons for continuing to refine modeling procedures. The initial models may not achieve project goals. Models may contain anomalies or patterns that indicate underlying problems with data or methods. Models may prove to be uninterpretable. Or opportunities, such as newly available data, may prompt a new round of modeling. In this project, meetings with project advisors provided the modeling team with valuable feedback and infusions of new ideas for improving model performance. Refining models may involve incorporating more or better data, correcting errors in base data or variables, modifying procedures for deriving variables, or modifying modeling procedures. In Mn/Model, refinements made in each project phase produced measurable improvements in the quality of the models produced (see Chapter 8).

Return to Top

4.8 Conclusion

GIS design for a large project such as this must consider a number of factors. The GIS standards of the end user must be met. Data and model resolution must be fine enough to detect environmental variations related to archaeological site location, but not finer than the available data will support. Appropriate hardware, software, and experienced staff must be available to conduct data conversion and modeling activities. A significant portion of the necessary data must be available in digital format. Consistent, efficient procedures and quality control mechanisms must be established. These should include strategies for managing hard drive space and processor use. All procedures should be extensively documented, and all critical data layers must be archived, preferably in more than one place and more than one format. The modeling team must maintain flexibility and creativity to modify procedures based on observed results or feedback from others.

A large part of this project has been management and conversion of GIS data. Developing procedures for these tasks has been as important as developing modeling procedures. Creating and maintaining the GIS database supports all other activities. This database can also be used on many kinds of future planning projects for MnDOT or other agencies, reducing the cost and improving the accuracy of those projects.

Finally, models must be developed in such a way that they can be used by the lead agency and others. Modeling goals appropriate to the proposed application must be established, procedures designed to measure progress towards those goals, and the resultant models presented in a way that will be interpretable by the end user. Chapters 9 and 11 discuss the implementation and application of Mn/Model to MnDOT projects.

Return to Top

REFERENCES

Anderson, P.F.
1995 GIS Modeling of Archaeological Sites in the Raccoon River Greenbelt. CRM report prepared
for the Dallas County Conservation Board, Iowa.

Anfinson, S.F.
   1990 Archaeological Regions in Minnesota and the Woodland Period. In The Woodland Tradition
       in the Western Great Lakes: Papers Presented to Elden Johnson, edited by G. E. Gibbon,
       pp. 135-166. University of Minnesota Publications in Anthropology No. 4. Department of
       Anthropology, University of Minnesota, Minneapolis.

Bannister, P.
1976 Introduction to Physiological Plant Ecology. John Wiley and Sons, New York.

Bonham-Carter, G.F.
1994 Geographic Information Systems for Geoscientists: Modelling with GIS. Pergamon
(Elsevier Science Ltd), New York.

Brady, N.C.
   1974 The Nature and Properties of Soils, 8^th Edition. Macmillan, New York. Carmichael, D.L.
   1990 GIS predictive modeling of prehistoric site distributions in central Montana. In Interpreting
       Space: GIS and Archaeology, edited by K.M.S. Allen, S.W. Green, and E.B.W. Zubrow, pp.
       216-225. Taylor and Francis, New York.

Cialek, C.
1993 Standards Flexibility Creates Cooperation, Government Technology, September 1993.

Cooper, A.W.
1961 Relationships between Plant Life-Forms and Microclimate in Southeastern Michigan.
Ecological Monographs 31(1): 31-59.

Dalla Bona, L.
   1994 Cultural Heritage Resource Predictive Modelling Project: Volume 4, A Predictive
       Model of Prehistoric Activity Location for Thunder Bay District, Ontario. Ontario Ministry
       of Natural Resources, Toronto.

DeMers, M.N.
   1997 Fundamentals of Geographic Information Systems. John Wiley and Sons, New York.
       ESRI
   1992 Map Projections in ARC/INFO. In ARC News, Fall 1992: 6-7. ESRI, Redlands, California.
       Estes, J.E., D.S. Simonett, et al.
   1975 Fundamentals of Image Interpretation. In Manual of Remote Sensing Volume II:
       Interpretation and Applications, 1^st ed., edited by R.G. Reeves, A. Anson, and D. Landen,
       pp. 869-1076. American Society of Photogrammetry, Falls Church, Virginia.

Etherington, J.R.
1975 Environment and Plant Ecology. John Wiley and Sons, New York.

Eyre, S.R.
   1968 Vegetation and Soils: A World Picture, Second Edition. Edward Arnold, London. Federal
       Geographic Data Committee
   1997 Framework Introduction and Guide. Federal Geographic Data Committee. Washington,
       D.C. Fisher, P.F.
   1991 Spatial Data Sources and Data Problems. In Geographical Information Systems, Volume
       1: Principles, edited by D.J. Maguire, M.F. Goodchild, and D.W. Rhind, pp. 175-189.
       Longman Scientific and Technical with John Wiley and Sons, New York.

Grimm, E.C.
   1981 An Ecological and Paleoecological Study of the Vegetation in the Big Woods Region of
       Minnesota. Unpublished PH.D. dissertation, University of Minnesota, Minneapolis.
   1983 Chronology and Dynamics of Vegetation Change in the Prairie-Woodland Region of Southern
       Minnesota, U.S.A. New Phytologist 93: 311-350.
   1984 Fire and Other Factors Controlling the Big Woods Vegetation of Minnesota in the Mid-
       Nineteenth Century. Ecological Monographs 54: 291-311.

Hanson, D.H. and B.C. Hargrave.
1996 Development of a Multilevel Ecological Classification System for the State of Minnesota.
Environmental Monitoring and Assessment 39:75-84. Kluwer Academic Publishers. Netherlands.

Hasenstab, R.
1991 Wetlands as a Critical Variable in Predictive Modeling of Prehistoric Site Locations: A Case
Study from the Passaic River Basin. Man in the Northeast 42:39-61.

Healy, R.G.
   1991 Database Management Systems. In Geographical Information Systems, Volume 1:
       Principles, edited by DJ Maguire, M.F. Goodchild, and D.W. Rhind, pp. 251-267. Longman
       Scientific and Technical with John Wiley and Sons, New York.

Kohler, T.A. and S.C. Parker
   1986 Predictive Models for Archaeological Resource Location. In Advances in Archaeological
       Method and Theory, vol. 9, edited by M.B. Schiffer, pp. 397-452. Academic Press, New
       York.

Kvamme, K.L.
   1988 Development and Testing of Quantitative Models. In Quantifying the Present and
       Predicting the Past: Theory, Method, and Applications of Archaeological Predictive
       Modeling, edited by W.J. Judge and L. Sebastian, pp. 325-428. U.S. Department of the
       Interior, Bureau of Land Management, Denver, Colorado.
   1990 The Fundamental Principles and Practice of Predictive Archaeological Modeling. In Studies
       in Modern Archaeology Volume 3, Mathematics and Information Science in Archaeology:
       A Flexible Framework, edited by A. Voorrips, pp. 257-295. Holos, Bonn.
   1992 A Predictive Site Location Model on the High Plains: An Example with an Independent Test.
      Plains Anthropologist 37(138): 19-40.
   1992 A Predictive Site Location Model on the High Plains: An Example with an Independent Test.
       Plains Anthropologist 37: pp. 19-40.

Lafferty, R.H. III, J.L. Otinger, SC Scholtz, W.F. Limp, B. Watkins, and R.D. Jones
   1981 Settlement Predictions in Sparta: A Location Analysis and Cultural Resource Assessment of the
       Uplands of Calhoun County, Arkansas. Arkansas Archaeological Survey Research Series 14,
       W.F. Limp, ed. Fayetteville, Arkansas.

Maguire, DJ, M.F. Goodchild, and D.W. Rhind, eds.
1991 Geographical Information Systems, Volume 1: Principles. Longman Scientific and Technical
with John Wiley and Sons, New York.

Maling, DH
   1991 Coordinate Systems and Map Projections for GIS. In Geographical Information Systems,
       Volume 1: Principles, edited by DJ Maguire, M.F. Goodchild, and D.W. Rhind, pp. 135-146.
       Longman Scientific and Technical with John Wiley and Sons, New York.

Marschner, F.J.
   1974 The Original Vegetation of Minnesota. Compiled from U.S. General Land Office Survey
       notes. U.S. Department of Agriculture, Forest Service, North Central Forest Experiment Station,
       St. Paul, Minnesota.

McIntosh, R.P.
1967 The Continuum Concept of Vegetation. The Botanical Review 33: 130-187.

Minnesota Governor’s Council on Geographic Information GIS Standards Committee.
   1996 Minnesota Geographic Metadata Guidelines, Version 1.0. Minnesota State Planning
       Agency, Land Management Information Center, St. Paul, Minnesota.
   1998 Implementing the National Standard for Spatial Data Accuracy (Draft). Prepared for the
       8^th annual Minnesota GIS/LIS Consortium Conference. Ms. on file, Minnesota State Planning
       Agency, Land Management Information Center, St. Paul, Minnesota.

Minnesota State Planning Agency, Land Management Information Center
   1989 Data Documentation for 40-Acre and 100-Meter Data for Use with EPPL7 on IBM PCs, or
       Compatibles. Ms. on file, Land Management Information Center, Minnesota State Planning
       Agency, St. Paul, Minnesota.

Montgomery, G.E. and H.C. Schuch
   1993 GIS Data Conversion Handbook. GIS World Books, Fort Collins, Colorado. Mooney, H.A.
   1974 Plant Forms in Relation to Environment. In Handbook of Vegetation Science, Part VI:
       Vegetation and Environment, edited by B.R. Strain and W.D. Billings, pp. 113-122. Dr. W.
       Junk b.v., The Hague.

Muller, J.-C.
   1991 Generalization of Spatial Databases. In Geographical Information Systems, Volume 1:
       Principles, edited by DJ Maguire, M.F. Goodchild, and D.W. Rhind, pp. 457-475. Longman
       Scientific and Technical with John Wiley and Sons, New York.

O’Neill, R.V., D.L. DeAngelis, J.B. Waide, and T.F.H. Allen
1986 A Hierarchical Concept of Ecosystems. Monographs in Population Biology 23, Princeton
University Press, Princeton, New Jersey.

Reeves, R.G., compiler
   1975 Glossary. In Manual of Remote Sensing Volume II: Interpretation and Applications, 1^st ed.,
       edited by R.G. Reeves, A. Anson, and D. Landen, pp. 869-1076. American Society of
       Photogrammetry, Falls Church, Virginia.

Santos, K.M. and J.E. Gauster
   1993 User’s Guide to National Wetlands Inventory Maps (Region 3) and to Classification of
       Wetlands and Deepwater Habitats of the United States. U.S. Fish and Wildlife Service,
       National Wetlands Inventory, Region 3, 4101 East 80^th Street, Bloomington, Minnesota.

Scott, D.
   1974 Description of Relationships between Plants and Environment. In Handbook of Vegetation
       Science, Part VI: Vegetation and Environment, edited by B.R. Strain and W.D. Billings, pp.
       49-69. Dr. W. Junk b.v., The Hague.

Scott, J.T.
   1974 Correlation of Vegetation with Environment: A Test of the Continuum and Community-Type
       Hypotheses. In Handbook of Vegetation Science, Part VI: Vegetation and Environment,
       edited by B.R. Strain and W.D. Billings, pp. 89-109. Dr. W. Junk b.v., The Hague.

Snyder, J.P. and P.M. Voxland
1989 An Album of Map Projections. U.S. Geological Survey Professional Paper 1453. United
States Government Printing Office, Washington, D.C.

Star, J. and J. Estes.
1990 Geographic Information Systems: An Introduction. Prentice Hall, Englewood Cliffs, New
Jersey.

Strahler, A.N.
1975 Physical Geography, 4^th Edition. John Wiley and Sons, New York.

The P118 Project Team
1997 Geographic Information Systems (GIS) Technology Standards (P118) Final Report. Ms. on file
Minnesota Department of Transportation, St. Paul, MN.

Trygg, J.W.
1964-1969 Composite Map of the United States Lane Surveyors Original Plats and Field Notes
(for Minnesota). Published by Trygg Land Company, Ely, Minnesota.

University of Minnesota, Department of Soil Science
1970-1976 Minnesota Soil Atlas (1:250,000 map sheets). University of Minnesota Agricultural
Experiment Station, St. Paul.

Warren, R.E.
   1990a Predictive Modelling of Archaeological Site Location: A Case Study in the Midwest. In
       Interpreting Space: GIS and Archaeology, edited by K.M.S. Allen, S.W. Green, and E.B.W.
       Zubrow, pp. 201-215. Taylor and Francis, New York.
   1990b Predictive Modelling in Archaeology: a Primer In Interpreting Space: GIS and Archaeology,
       edited by K.M.S. Allen, S.W. Green, and E.B.W. Zubrow, pp. 90-111. Taylor and Francis, New
       York.

Warren, R.E. and D.L. Asch
   1996 A Predictive Model of Archaeological Site Location in the Eastern Prairie Peninsula, Illinois.
       Illinois State Museum (preliminary, unpublished manuscript, scheduled to come out in July 1999,
       In 'Practical applications of GIS for archeologists: a predictive modeling toolkit, edited by
       K. L. Wescott and R. Joe Brandon. Taylor & Francis, London).

Williams, I., W.F. Limp, and F.L. Briuer
   1990 Using Geographic Information Systems and Exploratory Data Analysis for Archaeological Site
       Classification and Analysis. In Interpreting Space: GIS and Archaeology, edited by K.M.S.
       Allen, S.W. Green, and E.B.W. Zubrow, pp. 239-273. Taylor and Francis, New York.

Wise, C.L.
   1981 A Predictive Model Study for Cultural Resource Reconnaissance for the Central and
       Lower Passaic River Basin. Soil Systems, Inc., Atlanta. Submitted to the New York District,
       Army Corps of Engineers, New York (as cited in Hasenstab 1991).

Young, P.M., M.R. Horne, C.D. Varley, P.J. Racher, and A.J. Clish.
1995 A Biophysical Model for Prehistoric Archaeological Sites in Southern Ontario. Research and
Development Branch, Ministry of Transportation, Ontario.

Return to Top

The Mn/Model Final Report (Phases 1-3) is available on CD-ROM. Copies may be requested by visiting the contact page.

MnModel Orange Bar Logo

Acknowledgements

MnModel was financed with Transportation Enhancement and State Planning and Research funds from the Federal Highway Administration and a Minnesota Department of Transportation match.

Copyright Notice

The MnModel process and the predictive models it produced are copyrighted by the Minnesota Department of Transportation (MnDOT), 2000. They may not be used without MnDOT's consent.