Developing and Analyzing a Spanish Corpus for Forensic Purposes
DOI:
https://doi.org/10.5195/lesli.2019.19Keywords:
forensic linguistics, linguistic corpus, morphosyntactic analysis, semanticsAbstract
In this paper, the methods for developing a database of Spanish writing that can be used for forensic linguistic research are presented, including our data collection procedures. Specifically, the main instrument used for data collection has been translated into Spanish and adapted from Chaski (2001). It consists of ten tasks, by means of which the subjects are asked to write formal and informal texts about different topics. To date, 93 undergraduates from Spanish universities have already participated in the study and prisoners convicted of gender-based abuse have participated. A twofold analysis has been performed, since the data collected have been approached from a semantic and a morphosyntactic perspective. Regarding the semantic analysis, psycholinguistic categories have been used, many of them taken from the LIWC dictionary (Pennebaker et al., 2001). In order to obtain a more comprehensive depiction of the linguistic data, some other ad-hoc categories have been created, based on the corpus itself, using a double-check method for their validation so as to ensure inter-rater reliability. Furthermore, as regards morphosyntactic analysis, the natural language processing tool ALIAS TATTLER is being developed for Spanish. Results shows that is it possible to differentiate non-abusers from abusers with strong accuracy based on linguistic features.References
Almela, A., Alcaraz-Mármol, G. and Cantos, P. (2015). Analysing deception in a psychopath's speech: A quantitative approach. DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada, 31(2), 559-572.
Almela, A., Valencia-García, R. and Cantos, P. (2013). Seeing through Deception: A Computational Approach to Deceit Detection in Spanish Written Communication. Linguistic Evidence in Security, Law and Intelligence, 1(1), 3-12.
Baker, P. (2012). Acceptable bias? Using corpus linguistics methods with critical discourse analysis. Critical Discourse Studies, 9(3), 247-256.
Cantos Gomez, P. (2013). Statistical Methods in Language and Linguistic Research. Sheffield, UK: Equinox Publishing Ltd.
Chaski, C.E. (2001). Empirical Evaluations of Language-based Author Identification Techniques. International Journal of Speech, Language and Law (previously Forensic Linguistics), 8(1): 1-66.
Chaski, C.E. (2005). Who's at the keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence, Spring 2005.
Chaski, C.E. (2007). The Keyboard Dilemma and Author Identification, in Advances in Digital Forensics III, Sujeet Shinoi and Philip Craiger, eds., New York: Springer.
Chaski, C.E. (2012). Best Practices and Admissibility of Forensic Author Identification. Journal of Law and Policy, 21(2). Brooklyn Law School.
Coulthard, M. (1994). On the use of corpora in the analysis of forensic texts. International Journal of Speech, Language and Law (previously Forensic Linguistics), 1(1), 27-43.
Eagleson, R. (1994). Forensic analysis of personal written texts: a case study. In J. Gibbons (Ed.), Language and the Law. London: Longman.
Fornaciari, T. and Poesio, M. (2012). Sincere and deceptive statements in Italian criminal proceedings. In Proceedings of the International Association of Forensic Linguists Tenth Biennial Conference (pp. 126–138).
Guillén, V., Vargas, C., Pardiño, M., Martínez, P. and Suárez, A. (2008). Exploring State-of-the-art Software for Forensic Authorship Identification. International Journal of English Studies, 8(1), 1-28.
Hancock, J.T., Woodworth, M.T. and Porter, S. (2011). Hungry like the wolf: A word-pattern analysis of the language of psychopaths. Legal and Criminological Psychology, 18(1), 1-13.
Johnson, S.A. (2006). Physical Abusers and Sexual Offenders: Forensic and Clinical Strategies. New York: Taylor and Francis.
Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233-334.
Kennedy, G. (1998). An Introduction to Corpus Linguistics. London/New York: Longman.
Kniffka, H., (2000). Anonymous Authorship Analysis without Comparison Data? A Case Study with methodological impact. Linguistische Berichte, 182, 179-198.
Koppel, M., Schler, J. and Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9-26.
Leech, G. (2005). Adding Linguistic Annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbrow Books.
Leech, G. (1992). Corpora and theories of linguistic performance. In Jan Svartvik (Ed.), Directions in corpus linguistics. Berlin: Mouton De Gruyter (pp. 105-122).
McEnery, T. (2003). Corpus Linguistics. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press.
McEnery, T. and Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge Textbooks in Linguistics. Cambridge: Cambridge University Press.
Newman, M. L., Pennebaker, J. W., Berry, D. S. and Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29, 665-675.
Parodi, G. (2008). Lingüística de corpus: Una introducción al ámbito. Revista de Lingüística Teórica y Aplicada, 46(1), 93-119.
Pennebaker, J. W., Francis, M. E. and Booth, R. J. (2001). Linguistic Inquiry and Word Count. Mahwah (NJ): Erlbaum Publishers.
Renouf, A. (1987). Corpus Development, in Sinclair, J. M. (ed.) Looking Up. Glasgow/London: Harper Collins Publishers.
Saldanha, G. (2009). Principles of corpus linguistics and their application to translation studies research. Tradumàtica 7, 1-7.
Shapero, J. J. (2011). The Language of Suicide Notes. Unpublished Thesis. University of Birmingham
Stone, P.J., Bales, R.F., Namenwirth, J.Z., and Ogilvie, D.M. (1962). The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information. Journal of the Society for General Systems Research, October 1962.
Stone, P.J., Dunphy, D., Smith, M.S., and Ogilvie, D.M. (1966). The General Inquirer: a computer approach to content analysis. Cambridge, MA: MIT Press.
Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics, 10(1), 1-13.
Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam and Philadelphia: John Benjamins.
Zipf. G.K. (1949). Human Behavior and the Principle of Least Effort. Cambridge, Massachusetts: Addison-Wesley.
Downloads
Published
Issue
Section
License
- The Author shall grant to the Publisher and its agents the nonexclusive perpetual right and license to publish, archive, and make accessible the Work in whole or in part in all forms of media now or hereafter known under a Creative Commons Attribution 4.0 Licenseor its equivalent, which, for the avoidance of doubt, allows others to copy, distribute, and transmit the Work under the following conditions:
- Attribution—other users must attribute the Work in the manner specified by the author as indicated on the journal Web site;
- The Author is able to enter into separate, additional contractual arrangements for the nonexclusive distribution of the journal's published version of the Work (e.g., post it to an institutional repository or publish it in a book), as long as there is provided in the document an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post online a pre-publication manuscript (but not the Publisher’s final formatted PDF version of the Work) in institutional repositories or on their Websites prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (see The Effect of Open Access). Any such posting made before acceptance and publication of the Work shall be updated upon publication to include a reference to the Publisher-assigned DOI (Digital Object Identifier) and a link to the online abstract for the final published Work in the Journal.
- Upon Publisher’s request, the Author agrees to furnish promptly to Publisher, at the Author’s own expense, written evidence of the permissions, licenses, and consents for use of third-party material included within the Work, except as determined by Publisher to be covered by the principles of Fair Use.
- The Author represents and warrants that:
- the Work is the Author’s original work;
- the Author has not transferred, and will not transfer, exclusive rights in the Work to any third party;
- the Work is not pending review or under consideration by another publisher;
- the Work has not previously been published;
- the Work contains no misrepresentation or infringement of the Work or property of other authors or third parties; and
- the Work contains no libel, invasion of privacy, or other unlawful matter.
- The Author agrees to indemnify and hold Publisher harmless from Author’s breach of the representations and warranties contained in Paragraph 7 above, as well as any claim or proceeding relating to Publisher’s use and publication of any content contained in the Work, including third-party content.