Corpus Linguistics - Normalised Frequency

Gede Primahadi Wijaya Rajeg

25 Apr 202009:18

Summary

TLDRThis video explores the concept of normalized frequency in corpus linguistics, building on previous discussions of frequency analysis in the COCA corpus. It explains how absolute frequency can be misleading when comparing texts of different sizes and how normalization adjusts for this by considering text length. By dividing the raw frequency by the size of the text and multiplying by a base number (such as per 1 million words), normalized frequency allows for more accurate comparisons. The video emphasizes the importance of normalization for meaningful linguistic analysis, particularly in large corpora like COCA.

Takeaways

😀 Raw frequency measures the number of occurrences of a linguistic phenomenon, but it doesn't account for differences in text or genre size.
😀 Normalized frequency adjusts raw frequency based on the size of the corpus section, allowing for meaningful comparisons between different texts or genres.
😀 In corpus linguistics, comparing raw frequency across sections of different sizes can lead to misleading conclusions, making normalization essential.
😀 The process of normalization involves dividing the absolute frequency by the size of the section and multiplying by a base number, typically in terms of 1,000 or 1 million words.
😀 For example, comparing modal verbs in two texts of different lengths without normalization might suggest equal frequency, even though one text is larger.
😀 COCA (Corpus of Contemporary American English) uses normalized frequency per 1 million words to make comparisons across its different sections and genres.
😀 The absolute frequency of a linguistic phenomenon can be misleading unless adjusted for the size of the corpus sections being compared.
😀 Normalized frequency provides a more accurate reflection of a phenomenon's relative frequency across varying corpus sections or text genres.
😀 In corpus studies, normalization helps ensure that data comparisons are comparable, especially when comparing texts or genres with different sizes.
😀 COCA provides both absolute frequency and normalized frequency values, which allow users to compare data across sections of different sizes more fairly.

Q & A

What is normalized frequency in corpus linguistics?
-Normalized frequency is a way to adjust raw frequency counts to account for variations in the size of text sections or corpora, allowing for a more accurate comparison of linguistic phenomena across different texts or genres.
Why is it necessary to normalize frequency when comparing different sections of a corpus?
-Normalization is necessary because different sections of a corpus may vary in size, and raw frequency counts are influenced by these size differences. Normalizing the frequency allows for meaningful comparisons between sections of different lengths.
How is normalized frequency calculated?
-Normalized frequency is calculated by dividing the absolute frequency of a linguistic phenomenon by the size of the corpus section and then multiplying the result by a base number (such as 1,000 or 1,000,000) to standardize the frequency for comparison.
What is the significance of the per mille or per thousand words in corpus linguistics?
-The per mille or per thousand words (e.g., frequency per 1,000 words or per 1 million words) is used as a base number for normalization. This helps standardize frequency counts, making it possible to compare linguistic phenomena across corpora of varying sizes.
What was the example used to illustrate the concept of normalized frequency?
-The example involved comparing the frequency of modal verbs in two texts of different lengths. Text 1 had 750 words, and Text 2 had 1,200 words. After normalizing the frequency, the comparison revealed that the relative frequency of modal verbs was not the same in both texts, despite their raw frequency being equal.
How does normalization affect the conclusion about the frequency of modal verbs in the example?
-Normalization revealed that the initial conclusion, which suggested an equal frequency of modal verbs, was misleading. The normalized frequency showed that there were more modal verbs per 1,000 words in Text 1 than in Text 2.
What is the purpose of dividing the absolute frequency by the size of the genre section in corpus linguistics?
-Dividing the absolute frequency by the size of the genre section ensures that the frequency count is adjusted to account for differences in text length, allowing for accurate and comparable results across genres of varying sizes.
Why are different base numbers used for normalization in corpus linguistics?
-Different base numbers are chosen for normalization based on the typical size of the sections being compared. A base number close to the size of the sections (like 1,000 or 1 million words) ensures that the frequency counts are comparable and meaningful.
What is the role of genre sections in the corpus when calculating normalized frequency?
-Genre sections in the corpus represent different categories of text, each with a different size. When calculating normalized frequency, the size of these sections is taken into account to ensure that comparisons between genres are accurate and reflect relative frequency rather than raw frequency.
How do you determine which base number to use for normalization?
-The base number for normalization is typically chosen to be close to the size of the corpus sections being compared. For example, a base number of 1,000 or 1 million words is often used, depending on the overall size of the sections.

Outlines

plate

This section is available to paid users only. Please upgrade to access this part.

Mindmap

plate

This section is available to paid users only. Please upgrade to access this part.

Keywords

plate

This section is available to paid users only. Please upgrade to access this part.

Highlights

plate

This section is available to paid users only. Please upgrade to access this part.

Transcripts

plate

This section is available to paid users only. Please upgrade to access this part.

Browse More Related Video

Linguistik Digital - Video Material 1

Corpus Linguistics: The Basics

A Level English Language (9093) Paper 3: N-Gram Graphs (2022 Past Papers)

Byte Pair Encoding Tokenization

BL Riccio - Didattica delle lingue moderne - Lezione 1

02 2 CONSTRUCTION

Rate This

★

★

★

★

★

5.0 / 5 (0 votes)

Related Tags

Corpus LinguisticsNormalized FrequencyText AnalysisFrequency ComparisonLinguistic DataCOCALinguistics StudyText LengthCorpus ResearchAcademic LectureLanguage Analysis