Chapter 11: Systematic Observational Methods and Naturalistic Research
Systematic observational methods need a replicable system of assigning values to observed events. It can happen in the lab, but tends to happen in the field.
Behavior is natural if it is unselfconscious. Treatment is natural if it would have occurred without the experiment and the participant is unaware. A setting is natural if it’s not perceived to have been established for the purpose of conducting research. Naturalness places no constraints on behaviors and does not impinge on the environment.
When an observer “goes native” to gain entry into a group he runs the risk of over-identifying with them. He may become blind to aspects of the situation or draw attention to events perceived to be of interest to the group. However, going native is useful in situations of the exploratory, hypothesis-generating variety, in which great amounts of “rich”, if not necessarily reliable, information are needed.
When coding behaviors, one may opt for complete systemization or a less structured observation. There’s an experimenter expectancy bias with early data returns that can be avoided by not conducting a detailed inspection of results until all data is collected. Generalizability is difficult.
A category is a description of a behavior and can be as simple as a count of the number of times a specific event occurs. In this case, the degree of intensity is usually not a concern. Observations may be timed (when and how long). And the unit issue has troubling task of determining when one behavior stopped and another behavior began.
When constructing category systems, take into consideration if you want to know the intensity or extensity of a particular behavior. Also, take careful consideration in the number of categories, if you want dimensions instead of classifications, and if you want coders to use inference. (Should system deal with observable events or call for an explanation of possible motives?)
When coding behaviors, Weick (1985) outlined four types of comparisons for reliability: (1) two people observe the event at the same time – Cohen’s kappa assess the correlation; (2) one person observes similar events at two different times; (3) two people observe at two different times (lowest reliability); (4) one person observes one event and it’s checked for internal consistency in a manner similar to odd-even item correlations in a test.
Chapter 12: Interviewing
Interviews take more time and effort than written questions, but they are useful in getting sensitive or detailed information. They also bridge the written communication gap with certain populations (children, elderly, and ESL speakers) and can help achieve higher response rates.
Interviews may be conducted in-person or over the phone – each has its advantages. In-person will provide a deeper answer set, with nonverbal cues; however, they are time consuming and expensive. Telephone interviews can be done on a tighter budget, but interviewers are not as equipped to detect and correct participant confusion.
General guidelines for interview questions: keep them brief, ask directly, avoid double-barreled questions, and use simple language. Every question has an associated dropout rate, so only ask questions you need. For instance, questions about household income tend to have a 5-10% refusal rate, which could be the source of a systematic error. Use interview questions in a pre-test on a small sample of the population to uncover any problems that may not have been foreseeable.
Asking socio-demographic questions are the most common; they are trustworthy in answers probing about race, ethnicity, and income Information.
Asking for a reconstruction of events can be difficult, as many events are not important enough to be remembered. Recall will be better if the event (1) is unique; (2) has a large social impact; (3) has long-term, continuing consequences. When an interviewer asks for a reconstruction of events, he should stress the importance of complete answers and should use longer questions. Typically in interviews, shorter questions are better, but in the instance when accuracy of recall is at stake, longer questions evoke longer answers.
When asking questions to assess attitudes, bear in mind that an attitude is an evaluative belief and it may or may not carry any behavioral implications. When asking attitude scale questions, there are two relevant aspects of the questions – constraint and specificity. With constraint, if you’re surveying a population in which most people have an opinion on an issue but may not be firm about it, do not include “don’t know” in the a limited set of allowable responses because it will result in an underrepresentation of opinions held. With specificity, note that subtle changes in wording (such as “forbid” versus “not allow”) can have major implications.
When asking questions to determine private beliefs and actions, underreporting is a major concern. Try to secure a participant’s commitment to the interview by assuring results not traceable. Provide reinforcement and feedback during the course of the interview. One way around this: pair an innocuous question with a sensitive question. Ask the respondent to flip a coin and decide which question to answer, without disclosing which was chosen. Another method: provide two lists of behaviors, one with innocuous behaviors and one with innocuous behaviors and a single sensitive one. Differences in the number of reported behaviors are an estimate of the population’s behaviors. This method of reporting an overall number is less intrusive than confirming or denying each behavior.
An exploratory interview is a nonstructured, free response observational investigation. It takes a skilled researcher and can be expensive. In the structured-nonscheduled (no questionnaire) interview, the interviewer must obtain certain highly specified information but there is no specification to the manner it has to be obtained (no preset questions). It, too, is expensive to conduct and assumes some form of exploratory investigation. The structured-scheduled interview is the cheapest of the three, though there are still restrictions on allowable interviewer behaviors. Closed questions are more common and offer the greatest degree of standardization when the list of possible responses is complete.
When establishing rapport, best to match physical appearance, dress, accent, apparent socioeconomic status, ethnic heritage, and most especially race. For longer term interactions, match other social characteristics (lifestyle). The researcher should show enthusiasm for research, professionalism, friendliness, and interest in respondent. These factors can’t be directly controlled but depend on experience and training of interview personnel. It’s best to ask the least threatening questions first. Avoid leading questions unless there’s a clear reason for doing so (ie: to indicate interviewer was paying attention). Silence used appropriately by the interviewer will encourage the interviewee to speak.
Chapter 13: Content Analysis
Content analysis is the study of communication materials, specifically in “what” (content) and “how” (how it’s delivered). The data in content analysis is similar to interviewing—qualitative and unstructured. Typically, the researcher is concerned with a communication that (1) was not elicited by some systematic set of questions chosen by the analyst; (2) does not contain all the information he would like it to contain; (3) is almost invariable stated in a manner not easily codified and analyzed.
Preliminary considerations for content analysis should ensure the method is compatible with the ultimate goal of their research, find body of content, develop sampling rules, and define the type of system. The researcher should practice using it on a sample (pilot), assess reliability, hire and train coders, code the material, check reliability, then analyze. Coding units should be determined beforehand; however, sometimes coding categories emerge empirically during the course of the analysis. (Not advisable due to costs.)
A coding unit is the specific unit to be classified. Coding units can be a word (simplest, but limited utility), a theme or assertion (usually a simple sentence derived from a more complex context), an item (news story or editorial), and the character (a specific individual or personality type).
A context unit is the context within which its meaning is to be inferred. It is where you’re looking for the unit, such as the meaning inferred from an item being on the front page of a newspaper.
In thematic analyses, the analyses is making use of both themes and words as coding units, which provides more information than analyses based on words alone. Example: suicide note analysis (real vs. fake). Sometimes spatial characteristics of the content (inches of newspaper column, length of article) and temporal characteristics of audio or visual communication (minutes of time devoted to a topic) are used. This assesses the amount of attention media is paying to a topic.
Sampling is usually a multistage operation. The researcher must define the universe of content, sources to find data, the extent of content to be investigated, and the time period to be used. Three key decisions: (1) coding scheme; (2) appropriate unit of analysis; (3) manner in which units will be sampled.
The reliability of coding has three key components. (1) Stability: extent to which the same coder assigns identical scores to the same content (test-retest reliability). (2) Reproducibility: extent to which outcome can be replicated by different people (inter-coder agreement). (3) Accuracy of a coding system with a known standard (most important).
If the goal is descriptive, analyst will present relative frequencies. If more than descriptive (inferential content analysis), the study’s potential value is increased but so are the risks of faulty generalizations. This should only be attempted after the validity of such inferences has been tested.
Chapter 14: Scaling Stimuli
The term “psychometric quality” refers to the validity and reliability of a scale or measure. Scales of individuals measure individuals’ feelings on an underlying construct. This is simply a measurement of individuals’ feelings on a certain issues. Stimulus scales measure perceived differences among stimuli. In this instance, there is a single perceived answer and differences between individual answers are considered “error.” Stimulus scales give a clear picture of preferences in a group and suggest the degree of unanimity.
Scaling stimuli provides equal-interval scales, which are rare in social science research. There are two different ways to create this measurement. (1) Pair comparison: using a group of stimuli, each stimulus is compared against all others. (2) Rank order: a group of stimuli are ordered based on a dimension or quality.
The assumptions in scaling operations is that the data are transitive. (If A>B and B>C then A>C). Intransitive choices can occur when the stimuli are (a) too similar or (b) the scale along which stimuli are judged is multidimensional
If too many intransitive choices occur, there are in four likely areas where the issues may arise: (1) Participants’ familiarity with the stimulus dimension (unfamiliar with some choices); (2) Definitional specificity of the choice dimension (choice dimension not clearly specified); (3) Differentiability of stimuli (stimuli are too similar); (4) Dimensionality or equivocality of the choice dimension.
The appropriate circumstances for an accurate and concise summary of the judgments of a group are as follows: a set of stimuli that are clearly discriminable; stimuli that the respondent sample is familiar with; stimuli that are well-defined; unidimensional choice dimension; reasonable number of stimuli (too many and participants get bored or drop out).
Note: interval scales do not allow for absolute judgments or judgments that involve ratios. (ex: Hopkins is twice as good as Pitt). Only ratio level data (with a meaningful zero) can provide this.
Rank order can be used to avoid problems with pair comparison. In this instance, there is no intransitivity. Therefore, if participants rank A>B and B >C, then A must be ranked > C. Respondent boredom or fatigue will result if the list is too long.
Rank order and pair comparison are similar in two ways: (1) Comparative judgments: provide info regarding differences among stimuli, and (2) respondent differences are considered error (respondents are viewed as replicates). Assumption 2 is necessary to justify pooling responses over participants.
Multidimensional scaling models add dimensions to reflect the complexity of an issue. For example, judging actors on the single dimension of “acting ability” oversimplifies the complex judgment.
The unfolding technique (Coombs, 1964) bridges the stimulus and individual differences scaling. For example, on a horizontal line, individuals will be scaled in order on the top half, with stimulus scaled on the bottom half. This is quite complicated.
Chapter 15: Scaling Individuals
Questionnaires are short in order to save money and to avoid participant fatigue (which can lead to bad data). They can be in interview or written form. There are no formal rules, but there are rules of thumb: (1) Ask direct questions in simple sentences, avoid double-barreled questions, and don’t ask for personal info unless necessary; (2) Open-ended questions are more sensitive but are harder to analyze; (3) Question order: least threatening questions first, rotate order (within blocks, not whole thing); (4) Drop out and no opinion response format: should likely drop all data for that participant.
Rating scales are formal versions of questionnaires, usually designed to measure one specific attitude or value. Stimuli are called items and are often a statement the participant should endorse or reject.
Thurstone’s Method of Equal Appearing Intervals is when the respondent is asked to endorse the items with which he agrees. After an item is endorsed by the respondent, the items more or less extreme should be rejected. This is a non-monotone (noncumulative) report, so we don’t sum a respondent’s scores. Note: agreement with one item does not imply a probability of agreement with another item. This scale is difficult to develop There are four steps: (1) Researcher generates potential items, covering the range of possible evaluations; (2) Judges independently estimate the degree of favorability or unfavorability; (3) Determine mean and sd from judges’ ratings. Discard items with high standard deviations; (4) Choose items (15-25) so the scale value covers the range of opinions. The researcher instructs participants to “read all items and choose the two that best express your feelings on the topic.” Average scale value of the items chosen reflect the individual’s attitude. Internal consistency is not meaningful because we use participants’ responses to all items. Test-retest reliability should be considered, but test-retest reliability is likely to be low. Objections: Forces investigator to employ costly techniques to estimate reliability (these are susceptible to failure); Unclear whether a judge can truly disregard personal feelings; Difficult and time consuming; Does not take advantage of technology
Guttman’s Scalogram Analysis is not very frequently used. (Crano has seen this scale once in his career.) It makes use of the concept of cumulative (monotone) items. The more extreme a respondent’s score, the more extreme their attitude. If scale is of a high quality, endorsing an extreme idea should correlate with less extreme items. If an investigator can reconstruct the specific set of alternatives that were chosen by knowing a respondent’s total score, the scale is said to possess a high coefficient of reproducibility. To calculate we need to know the total number of responses by total sample of respondents and the number of times that participants’ choices fell outside the predicted pattern of responses. This way, the coefficient of reproducibility is the total errors divided by the total response. Difficult to establish internal consistency reliability coefficient, so it is underutilized .
Likert’s method of summated ratings is more efficient (time/resource expenditure) and effective (reliability: both internal consistency and temporal stability) than previous two method. Participants indicate the extent they agree with a position (SA, A, N, D, SD). Their attitude is the sum of the responses. Sstatements can be positively or negatively worded. A single item is a fallible indicator of the underlying cognitive construct. We are able to minimize error in each item by having them contribute to overall score (like: multiple operationalization). Steps to creating a Likert scale: (1) Collect items moderately favorable or unfavorable toward the object under study; (2) Administer item set to a group of respondents: multiply number of items by 5 or 10; (3) Items scored and summed for each participant, creating a total score for each person; (4) Calculate matrix of inter-correlations btwn all pairs of items and btwn each item and total score. Use each item’s correlation with total score to identify low scoring items and remove: (5) Recalculate item-total correlation of this reduced set and recalculate the coefficient alpha: 0.75 is reasonably accurate. Some shrinkage is expected when administered to a different sample. If weak, add more items that correlate positively with the original set. If too many items needed, the scale is most likely multidimensional; (6) Administer to a new set of participants: Calculate reliability; (if good) sum scales calculated for each respondent. There are typically five response options. 7-point scales might be better. The disadvantages are in the time and effort in the scale construction process. It requires the development of a new set of items each time participants’ attitudes towards a new object is to be assessed
Osgood’s Semantic Differential Technique is a seven point scales between bipolar adjectives. The respondent’s attitude is the sum of his score. Scales that cluster together focus on same construct. Potency (strong/weak); Activity(active passive). Semantic differential technique offers the researcher a ready-made attitude scale. The disadvantage is that we need to use common sense when choosing the specific evaluative scales. Indirect approaches: investigator interprets the responses in categories diff from those in mind by the respondent while answering. Misdirection should lower defenses and present a more valid picture of their attitudes, rather than participant showing himself only in a favorable light. Indirect techniques have been developed, but their success is not good: Sentence completion, Apperception Tests (generates a story about a picture they view), Method of coding integrative complexity of political leaders. Disadvantages: demands on the time and technical expertise of the investigator and needs inter-rater reliability. Advantage: provide valuable insight into processes
In summation, with Thurstone, there’s no way to measure reliability. Likert is the used most and has good reliability.
Chapter 16: Social Cognition Methods
Because inner experiences (personal feelings and mental life) are not directly observable, social researchers must often rely on people’s introspective reports of their private experience to acquire data that are amenable to recording and quantification. However, there are problems with self-reports. Subjects may adjust responses to meet personal standards, social standards, or social expectations. This is due to evaluation apprehension and social desirability concerns, esp. when issues are embarrassing, sensitive, or political. Participants may be unable to report accurately if they don’t have conscious access to mental processes that underlie behaviors or decisions.
Attention (what information is taken in) is a limited resource, selectively distributed among stimuli. We also need to know about encoding (how that information is understood and interpreted at the time of intake), storage (how information is retained in long-term memory), and retrieval (what information is accessible in memory). Measuring how long a person attends to stimuli tells us about its importance. The measure can be visual attention, inference, processing time.
Memory is assessed by: (1) recall measures (free or cued) – look at quantity, accuracy, sequence; (2) recognition measures (review each item and determine if it was introduced or if it’s new info).
Priming is the unintended influence that recent experiences have on thoughts, feelings, and behaviors. In concept priming, participants are primed with words (ie: old) and it leaks into assigned task. In supraliminal priming, participants have a conscious awareness of priming, though they’re unaware of purpose. Subliminal priming is subconscious (ie: flashed on screen briefly, below awareness). This can involve (1) foveal processing (the stimulus is flashed at the focal point) or parafoveal processing (the stimulus is flashed at the edge of the visual field). In sequential priming a stimulus is associated with a concept/feeling so presenting that stimulus will automatically activate (prime) those associations.
Two techniques to reduce the influence of intentional processing: (1) Cognitive busyness. If you require participants to hold an eight-digit number in memory while they are engaged in judgment or other task, it is difficult but doable and you can check the effectiveness. (2) Response interference methods. When responses are automatically elicited, they can interfere with production of other responses that are incompatible with it. Examples: Stroop effect (color-naming task): i.e. the word “green” is appears in red text. Implicit Association Test: two buttons, each with a value (good/bad), measure time it takes to make a judgment
Physiological responses (respiration, pulse, finger temperature, Galvanic skin response) can distinguish between motivational states, positive & negative affect, attention, and active cognitive processing. Heart rate alone is an inadequate marker of a specific arousal state, but combined with other measures of cardiac and vascular performance, anticipation and stress can be identified. Specific patterns of cardiovascular response can distinguish between feelings of threat versus challenge as motivational states in anticipation of potentially difficult or stressful situations
Facial action coding system (FACS) assesses emotional states based on spontaneous facial expressions. Drawbacks: extensive training required and as it is possible to control facial response, it isn’t always a measure of affect.
Other techniques include: (1) Facial Electromyograph (EMG): Overt facial expressions are potentially controllable, but the tiny, visually imperceptible movements of specific facial muscles are not. EMG measures focus on specific muscles of the eyes and mouth associated with frowning and smiling. Activity of these muscles is not overtly controllable and reveals underlying affective state. (2) Startle Eye blink reflex: Refers to the reflexive blinks that occur when individuals perceive an unexpected, relatively intense stimulus such as a loud noise. Eye blink response has been shown to be enhanced if the perceiver is in a negative affective state; inhibited if in a positive one. Measures of Brain Activity: EEG and Neuroimaging
In conclusion, triangulation of results is important. Each measure has its own strengths and weaknesses. Use multiple approaches reinforce or validate understanding.
Chapter 18: Meta-analysis
We can understand the structure of a phenomenon by studying the structure of interrelationships that underlie it (to the extent the existing literature is accurate). In the past this was based on narrative review (careful reading and interpretation of results), but this approach fails to completely survey the existing knowledge base, to clearly state the rules for inclusions and exclusion, or to use a common statistical metric for combing findings.
Over the past 20-30 years, quantitative synthesis (aka meta-analysis) has become more common. It allows for the quantitative assessment of factors and offers a means of addressing problems that narrative analysis cannot. Meta-analysis assesses the construct validity of research findings that involve different methods of design and measurement. It converts the results of different studies into a common metric and then combines and compares the results across studies.
The first step of meta-analysis is to grasp the scientific results on the phenomenon being studied. Begin with a hypothesis linking an IV with a DV (or a relationship between two DVs if correlational) in a simple A-B relationship. Tentatively decide which studies will be included/excluded and gather every possible study that meets the criteria. Calculate and analyze the magnitude of the effect size indices drawn from the data. (The effect size index is the basic unit of all meta-analysis – calculating it can be difficult when studies fail to provide the data needed to calculate effects.) The goal is to develop a statistical indicator of the strength of a given manipulation and when it’s transformed to a common metric we can make direct comparisons and aggregate data. Usually those with more reliable results (bigger N) are more heavily weighted in the meta-analysis. Then we can compute a confidence interval around the weighted mean. If the interval does not contain zero, it suggests a reliable relationship between cause and effect variables. However, no interpretations should be made until homogeneity of variances of effect size indices is tested. When there’s a strong variance, the weighted effect size index is not accurate and one should search for moderator variables.
After establishing moderators, each study should be coded based on the moderator. The effect of moderator variable can be tested in two ways: (1) Dividing the total set of studies into subsets that differ on the characteristic in question to determine whether the average effect sizes differ significantly from each other. (2) Enter coded variable into a correlation (or multiple regression) analysis with effect size as the dependent measure. This second option is appropriate when the variable is quantitative.
When interpreting results, if the mean effect size over all studies is small and not statistically significant, you need to determine whether that null finding is not the result of a homogenous data set. Some meta-analyses may find a distribution of mean effect size indices that is not heterogeneous but still produce non-significant results [hypothesized effects appear too weak to matter]. Cohen’s break down of effect sizes (described as a function of the amount of variance they explain): 1% (0.1) is small, 9% is medium, 25% is large. Small effect sizes could be worth considering on a practical level. Rosenthal and Rubin proposed a method of interpreting effect sizes in terms of the differences in positive/negative outcomes between treatment groups: binomial effect size display (BESD).
Syntheses may be narrative or qualitative. Qualitative syntheses are characterized by counting the number of studies in which statistically significant/non-significant results have been obtained. Direct comparison of probability values is not wise, as these values between studies are not directly comparable and you may lose considerable information. Also, this does not consider the magnitude of effects or provide information on the direction, or trend, observed in statistically non-significant results. Meta-analysis results are less conservative than those based on tabulation methods. However, these two approaches are best used in combination. When there are significant variations across studies in terms of effects obtained, a careful examination of substantive and methodological differences among the studies is essential for drawing any meaningful conclusions.
Chapter 19: Social Responsibility and Ethics in Social Research
Informed consent is paramount. Participants should have full knowledge of what participation will involve and their choice to participate should be voluntary. When deviating from this guideline, the rights of the participant must be weighed against the potential significance of the research. The board may decide that consent be based on trust in the qualified investigator and the integrity of the research institution.
There may be deception in the lab to control participants’ perceptions and this is a controversial aspect of social research, particularly since this violates the “do no harm” directive. Most researchers try to minimize the amount of deception employed and justify it as a “white lie” that doesn’t harm the subject and is in the service of the greater good. Kelman recommends: (1) reducing the unnecessary use of deception; (2) exploring ways to counteract or minimize the negative consequences; (3) developing new methods (e.g., role playing, simulation) that don’t require deception. Note, the use of the alternatives in number three has shown mixed results.
Debriefing participants informs them of the true nature and purpose of the experimental treatments. Kelman regards including a debrief as mandatory. Milgram used extensive debriefing sessions where participants were reassured about the nature of their responses and encouraged to express their reactions in what was essentially a psychotherapeutic setting. Such rich debriefings might not only “undeceive” participants but also enrich their experience, promoting reflection and self-awareness. However, this can bias subject populations. It has been shown that participants deceived in one study behave in a manner biased toward self-preservation in subsequent studies. Smith and Richardson showed that participants who had been deceived viewed their experience more positively than those who had not, and that effective debriefing works. However, this might be a by-product of the fact that deception-based studies tend to be more interesting than those that do require deception.
Researchers must also consider issues of privacy and confidentiality when data are collected in field settings. The nature of the research is often disguised in field studies, so participants may not only be deceived about the nature of the research, but may even be unaware that they are the subject of research in the first place. Some scientists regard the practice of concealed observation or response elicitation as acceptable as long as it is limited to essentially “public” behaviors or settings normally open to public observation. Others regard any form of unaware participation in research as an intolerable invasion of the individual’s right to privacy.
Research data is subject to subpoena. If names are recorded for any reason the researcher might have to reveal them. In some cases a “certificate of confidentiality” can be obtained from the public health service, but most social research is not protected in this way.
Federally funded institutions are required to set up IRBs (Internal Review Boards) to evaluate, approve and monitor research. The IRB is made up of disinterested individuals, scientists and laypersons alike, whose primary goal is the protection of participants’ rights. Behavioral researchers are also subject to codes of ethics of the National Academy of Sciences and the APA. These codes provide guidelines in the planning, design and execution of research studies.