문서화 지침

문서화-1 (17). 검사가 다른 집단에서 사용될 수 있도록 번안될 때는, [원본 검사와의] 동등성을 지지하는 모든 증거에 대한 설명을 포함하여, 모든 변경사항을 기술 교본에 기록해서 제공한다⁸². 설명. 본 지침의 중요성은 많은 연구자에 의해 인식되고 강조되고 있다(예를 들어, Grisay,

81 측정 원점 동일성(Scalar invariance)에서 원점은 구성개념(F)을 측정치(X)에 의해 측정하는 과정을 나타내는 1차식(X=a+bF+e)에서 절편(a)을 의미하는데, 구성개념이 0일 때 부여되는 측정치의 값이므로 측정원점(origin)이라고도 한다. 위의 1차식에서 b는 측정단위이다. 통상적으로 측정단위 동일성이 성립해야 측정원점 동일성을 검증하게 된다. 만일에 집단간에 단위도 원점도 동일하면 측정치(실제는 검사점수)가 동일하다고 해석된다.

82 검사제작후에 두가지 종류의 교본이 작성된다. 하나는 사용자들을 위해 검사의 목적 및 유용성, 그리고 실제 사용의 실무적 사항을 중심으로 한 사용자 교본이고, 나머지는 검사의 품질을 검토하는 검사이론의 전문가들을 위해 기술적 사항을 총괄해서 기록한 기술 교본(통상 개발 보고서의 형식을 취함)이다. 따라서 번안 검사에도 사용자 교본에 더해서 기술교본이 제공된다. 사용자 교본은 문서화-2(18)에서 설명된다

2003 참조). TIMSS 와 PISA는 번안 작업 전반에 걸쳐 변경사항을 상세하게 문서로 만들어 이 지침을 성공적으로 준수해 왔다. 이러한 문서화는, 이후에 도입되는 변경사항이 적절한지의 판단을 위한 근거가 된다.

또한 기술 교본에는, 나중에 연구자들이 동일 집단 또는 다른 집단에 그 절차를 재현할 수 있도록 방법론에 대하여 충분히 상세한 정보가 수록되어야 한다. 기술교본에는 새로운 집단에서 번안된 검사의 사용을 지지하는 구성개념 동등성, 점수체계 동등성([연계]가 실행된 경우)의 증거에 대한 충분한 정보가 수록되어야 한다. 집단 간 비교가 이루어질 경우, 기술 교본에는 집단 간 점수의 동등화를 결정하는데 사용된 증거를 보고 해야 한다.

때로는, 기술 교본에 관심있어 하는 독자로부터 질문이 제기될 수 있다. 이 교본은 이 분야 전문가는 물론, [검사가] 새로운 다른 집단에 사용되기 전에 그 검사의 유용성을 평가하는 사람들을 위해 작성되어야 한다 (비전문가의 이해를 위해 간단한 보충 문서가 추가될 수 있다).

실무를 위한 제안. 번안 검사에는 번안 과정과 관련된 모든 양적, 질적 자료들을 문서화한 기술 교본이 있어야 한다. 특히 번안 과정에서 제2의 언어 및 문화권에 맞게 변경된 사항을 모두 문서화하는 것이 도움이 된다. 기본적으로 이 분야 전문가와 학술지 편집자는 검사의 번안판을 개발하고 타당화하는데 투입된 모든 과정에 대하여 기술된 문서를 원할 것이다.

물론, 그들은 모든 분석 결과를 보고 싶어 할 것이다. 다음은 그들에게 답변해야 할 질문의 유형이다.

 [검사에서 측정되는] 구성개념 및 새로운 집단에 맞춰 번안된 검사의 효용성을 지지하는 증거는 무엇인가⁸³?

 어떠한 표본에서 어떠한 문항들에 대한 자료가 수집되었는가⁸⁴?

 내용타당도, 준거타당도, 그리고 구성개념 타당도를 평가하기 위해 어떤 다른 자료를 구했는가⁸⁵?

 다양한 자료들이 어떻게 분석되었는가?

83 번안검사의 평가자나 소비자라면 우선 해당 구성개념의 조작적 정의 또는 그것을 측정할 필요에 대한 증거(구성개념의 효용성)를 찾고자 할 것이다. 그리고 나면 그것을 측정하는 번안검사의 타당도 증거(번안검사의 효용성)를 보고자 할 것이다.

84 문항관련 자료이므로 예비검사 단계에서의 내용이 중심이 된다.

85 타당화단계이므로 본검사 및 필요에 따른 추가검사 단계를 의미한다.

 그 결과물들은 무엇인가?

문서화-2 (18). 새로운 집단에서 번안 검사를 사용하는 실무자를 지원해줄 사용자 교본을 제공한다.

설명. 이 문서는 실제 평가장면에서 번안 검사를 사용할 사람을 위해 작성되어야 한다.

검사 이용에 관한 국제검사위원회(ITC) 지침서에서 정한 모범적 실무와 일치해야 한다(www.InTestCom.org 참조).

실무를 위한 제안. 검사 개발자는 집단의 사회 문화 및 생태학적 환경이 검사 수행에 영향을 미칠 수 있는 가능성에 대하여 구체적인 정보를 제공해야 한다. 사용자 교본은 다음과 같이 구체적으로 구성되어야 한다.

 검사에 의해 측정되는 구성개념을 설명하고 정보를 요약한다. 즉 번안 과정을 기술한다.

 문항 내용의 문화적 적절성, 검사 지시사항, 반응 형식 등 번안을 뒷받침하는 모든 증거를 정리한다.

 대상집단 내의 다양한 하위집단들에 걸친 검사 사용의 적절성 및 기타 사용상의 제약 조건에 대하여 정의한다.

 검사 실행의 모범적 실무에 관련하여 검토해야 할 사항들에 관해 설명한다.

 집단 간 비교가 가능한지, 어떻게 비교할 수 있는지를 설명해야 한다.

 채점과 규준(예, 관련 규준표)⁸⁶을 위해 필요한 정보를 제공하거나 사용자가 어떻게 채점 절차(예, 컴퓨터 기반 검사)에 접근할 수 있는지 설명한다.

 검사 점수로부터의 추론에 관한 타당도 및 신뢰도 자료가 가지는 의미에 대한 정보를 포함하여 결과 해석에 관한 지침들을 제공한다.

맺음말

우리는 [번안] 검사 개발자와 사용자들의 업무 수행을 돕기 위한 일련의 지침을 전달하기 위해

86 검사의 목적이 규준참조검사(상대평가)일 때는 규준이 있을 것이고 준거(영역)참조 검사(절대평가)라면 기준점수가 있을 것이다.

최선을 다했다. 그러나 이러한 지침들 그리고 부실한 관행을 바꾸려는 노력이 효과를 거두기 위해서는 이를 전파하는 좋은 장치가 있어야 한다. Rios와 Sireci(2014)가 최근 체계적으로 검토한 결과, 문헌에 보고된 대부분의 검사 번안이, 이미 20년의 역사를 가진 ITC 지침을 따르지 않았다.

따라서 우리는 독자들이 동료들에게, 전 세계의 많은 전문가들이 기여한 최선의 실무 참고서로서의 지침서 2판에 대한 인지도를 높이기 위해 모든 노력을 기울일 것을 권한다.

동시에, 우리는 이 지침의 첫 번째 판이 현재 바뀌는 것처럼, 이어서 이번 두 번째 판의 지침들도 언젠가는 바뀔 것이다. 잘 알려진 AERA, APA, NCME 검사 표준서는 현재 6판(AERA, APA, &

NCME, 2014)까지 나와 있다. 우리는 번안검사에 대한 ITC의 지침도 앞으로 또 다른 개정을 겪을 것으로 기대한다. 인용될 필요가 있거나 제3판에 영향을 미치는 새로운 연구를 알고 있는 경우, 또는 여기에 제시된 18개 지침에 대한 새로운 지침이나 수정을 제공하고자 하는 경우 ITC에 알려주기 바란다 (www.InTestCom.org). 제2판을 제작한 현 “연구/지침 위원회” 위원장 또는 www.InTestCom.org 에 있는 이메일 주소로 ITC 간사에게 연락하면 된다.

감사의 인사말

국제검사위원회(ITC)는 몇 년 동안 번안검사 지침서 제2판을 만들기 위해 일한 6명의 위원들에게 감사를 표하고자 한다. Dave Bartram, SHL, UK; Giray Berberoglu, Middle East Technical University, Turkey; Jacques Grégoire, Catholic University of Louvain, Belgium; Ronald Hambleton, Committee Chairperson, University of Massachusetts Amherst, USA; Jose Muñiz, University of Oviedo, Spain;

and Fons van de Vijver, University of Tilburg, Netherlands.

또한 국제검사위원회는 Chad Buckendahl (미국), 영국에 있는 Anne Herrmann 과 그녀의 동료들 OPP Ltd. (영국), 그리고 매사추세츠 대학교(미국)에 있는 April Zenisky, 및 켄트 대학교(영국)에 있는 Anna Brown 에게 이 문서의 초안을 주의 깊게 검토해준 것에 대해 감사를 표한다. ITC는, 직간접적으로 ITC의 번역/번안 검사 지침서 제2판에 공헌한 전 세계의 다른 모든 검토자들에게도 감사드린다.

36 참고문헌

Allalouf, A., Hambleton, R. K., & Sireci, S. G. (1999). Identifying the causes of DIF in translated verbal items. Journal of Educational Measurement, 36(3), 185-198.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service.

Angoff, W. H., & Modu, C. C. (1973). Equating the scales of the Prueba de Apititud Academica and the Scholastic Aptitude Test (Research Rep No. 3). New York: College Entrance Examination Board.

Asparouhov, T. & Muthén, B. (2009). Exploratory structural modeling. Structural Equation Modeling, 16, 397-438.

Brislin, R. W. (1986). The wording and translation of research instruments. In W. J. Lonner & J. W.

Berry (Eds.), Field methods in cross-cultural psychology (pp. 137-164). Newbury Park, CA: Sage Publications.

Byrne, B. (2001). Structural equation modeling with AMOS, EQS, and LISREL: Comparative approaches to testing for the factorial validity of a measuring instrument. International Journal of Testing, 1, 55-86.

Byrne, B. (2003). Measuring self-concept measurement across culture: Issues, caveats, and application. In H. W. Marsh, R. Craven, & D. M. McInerney (Eds.), International advances in self research(pp. 30-41). Greenwich, CT: Information Age Publishing.

Byrne, B. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Publishers.

Byrne, B. M. (2008). Testing for multigroup equivalence of a measuring instrument: A walk through the process. Psichothema, 20, 872-882.

Byrne, B. M., & van de Vijver, F.J.R. (2010). Testing for measurement and structural equivalence in large-scale cross-cultural studies: Addressing the issue of nonequivalence. International Journal of Testing, 10, 107-132.

Byrne, B. M., & van de Vijver, F.J.R. (2014). Factorial structure of the Family Values Scale from a multilevel-multicultural perspective. International Journal of Testing, 14, 168-192.

Clauser, B. E., Nungester, R. J., Mazor, K., & Ripley, D. (1996). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational

37 Measurement, 33(2), 202-214.

Cook, L. L., & Schmitt-Cascallar, A. P. (2005). Establishing score comparability for tests given in different languages. In R. K. Hambleton, P. F. Merenda, & C. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 139-170).

Dorans, N. J., & Holland, P. W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W. Holland & H. Wainer (Eds.), Differential item functioning: Theory and Practice (pp. 137-166).

Ellis, B. B. (1989). Differential item functioning: Implications for test translation. Journal of Applied Psychology, 74, 912-921.

Ellis, B. B., & Kimmel, H. D. (1992). Identification of unique cultural response patterns by means of item response theory. Journal of Applied Psychology, 77, 177-184.

Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29(6), 543-533.

Ercikan, K. (2002). Disentangling sources of differential item functioning in multilanguage assessments. International Journal of Testing, 2(3), 199-215.

Ercikan, K., Gierl, J. J., McCreith, T., Puhan, G., & Koh, K. (2004). Comparability of bilingual versions of assessments: Sources of incomparability of English and French versions of Canada's national achievement tests. Applied Measurement in Education, 17(3), 301-321.

Ercikan, K., Simon, M., & Oliveri, M. E. (2013). Score comparability of multiple language versions of assessments within jurisdictions. In M. Simon, K. Ercikan, & M. Rousseau (Eds.), An international handbook for large-scale assessments (pp. 110-124). New York:

Grégoire, J., & Hambleton, R. K. (Eds.). (2009). Advances in test adaptation research [Special Issue].

International Journal of Testing, 9(2), 73-166.

Grisay, A. (2003). Translation procedures in OECD/PISA 2000 international assessment. Language Testing, 20(2), 225-240.

Hambleton, R. K. (2002). The next generation of the ITC test translation and adaptation guidelines.

European Journal of Psychological Assessment, 17(3), 164-172.

Hambleton, R. K. (2005). Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures. In R. K. Hambleton, P. F. Merenda, & C. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 3-38). Mahwah, NJ:

Lawrence Erlbaum Publishers.

Hambleton, R. K., & de Jong, J. (Eds.). (2003). Advances in translating and adapting educational and psychological tests. Language Testing, 20(2), 127-240.

Hambleton, R. K., & Patsula, L. (1999). Increasing the validity of adapted tests: Myths to be avoided and guidelines for improving test adaptation practices. Applied Testing Technology, 1(1), 1-16.

Hambleton, R. K., Clauser, B. E., Mazor, K. M., & Jones, R. W. (1993). Advances in the detection of differentially functioning test items. European Journal of Psychological Assessment, 9(1), 1-18.

Hambleton, R. K., & Lee, M. (2013). Methods of translating and adapting tests to increase crosslanguage validity. In D. Saklofske, C. Reynolds, & V. Schwean (Eds.), The Oxford handbook of child assessment (pp. 172-181). New York: Oxford University Press.

Hambleton, R. K., Merenda, P. F., & Spielberger, C. (Eds.). (2005). Adapting educational and psychological tests for cross-cultural assessment. Mahwah, NJ: Lawrence Erlbaum Publishers.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory.

Newbury Park, CA: Sage Publications.

Hambleton, R. K., Yu, L., & Slater, S. C. (1999). Field-test of ITC guidelines for adapting psychological tests. European Journal of Psychological Assessment, 15 (3), 270-276.

Hambleton, R. K., & Zenisky, A. (2010). Translating and adapting tests for cross-cultural assessment.

In D. Matsumoto & F. van de Vijver (Eds.), Cross-cultural research methods (pp. 46-74). New York, NY; Cambridge University Press.

Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, N. J.: Lawrence Erlbaum Associates.

Javaras, K. N., & Ripley, B. D. (2007). An 'unfolding' latent variable model for Likert attitude data:

Drawing inferences adjusted for response style. Journal of the American Statistical Association, 102, 454-463.

Jeanrie, C., & Bertrand, R. (1999). Translating tests with the International Test Commission Guidelines: Keeping validity in mind. European Journal of Psychological Assessment, 15(3), 277-283.

Johnson, T. R. (2003). On the use of heterogeneous thresholds ordinal regression models to account for individual differences in response style. Psychometrika, 68, 563-583.

Kolen, M. J., & Brennan, R. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer.

문서에서 검사 (페이지 32-44)