Thesis Data Analysis: The Psychology of Scientific Fraud
Author
Benjamin Zubaly
Published
March 22, 2024
Introduction
This document is intended to track the data analysis for my undergraduate thesis project on the psychology of scientific fraud. I have already planned out (and preregistered) my data analysis plan (see here), and I will note throughout the document if I deviate from this plan and why. This project is also being tracked in a private GitHub repository in case I need to revert back to a previous version of the project due to a fatal error. For reproducibility, the session info is displayed here:
Data Cleaning: The data cleaning section will take the finalized dataset that I have included in the project directory and use it to create the remaining variables that we need for our analysis. We will also save this dataset.
Data Exploration: In the data exploration section, we will check for and deal with missing data, run descriptive statistics of our outcome variables, visualize our data, and run bivariate correlations.
Testing Hypotheses: In the hypothesis testing section, we will test each of our hypotheses one-by-one, according to our analysis plan.
Directory Set-Up
In order to ensure that this analysis is computationally reproducible, I have included everything that is needed to complete this analysis (just the finalized data file, “study_dataset.csv”) in the current working directory. The code will be displayed before the output of each analysis.
Variable Definitions
Although the following variables may not exist until after the data cleaning section, here is what each variable name refers to, so that you can refer to them while examining the code.
DOI: Unique identifier for each paper (i.e., the paper DOI).
PaperType: Categorical variable indicating SAFP, SAGP, MAFP, or MAGP.
LingObf: Continuous variable for linguistic obfuscation.
CertSent: Continuous variable for certainty sentiment.
Refs: Count variable for references.
FraudCorrAuth: Dichotomous variable indicating if the fraudulent author is the corresponding author (1) or is not (0). Unknown cases will be marked in a seperate variable with this variable left blank.
NumAuth: Count variable indicating the number of authors for each paper.
abstraction: Abstraction index composed of the sum of standardized scores for article, prep, and quantity.
article: Articles from LIWC.
prep: Prepositions from LIWC.
quantity: Quantities from LIWC.
cause: Causation terms from LIWC.
jargon: The percent of words not captured by LIWC (100-Dic).
Dic: The percentage of words captured by all LIWC dictionaries.
emo_pos: Positive emotion terms from LIWC.
flesch_re: Flesch Reading Ease from ARTE.
Initial Package Installations
If you would like to reproduce this analysis, here are the package I will be using, so that they can be cued before starting.
install.packages("readr") # For reading data
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("dplyr") # For data manipulation and handling of missing data
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("psych") # For descriptive statistics, correlations
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("ggplot2") # For data visualization
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("car") # For diagnostic tests such as Levene's test
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("effsize") # For calculating effect sizes
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("rcompanion") # For Games-Howell post-hoc test (if applicable)
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("dunn.test") # For Dunn post-hoc test (if applicable)
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("boot") # For bootstrapped hypothesis tests (if applicable)
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
install.packages("sessioninfo") # For session info at end of document
Installing package into '/Users/benjaminzubaly/Library/R/x86_64/4.3/library'
(as 'lib' is unspecified)
The downloaded binary packages are in
/var/folders/38/1ybnplc53zdb089bn6drqfn00000gn/T//RtmpQYMDWi/downloaded_packages
Initial Import of Data
We will not load in the dataset “study_dataset.csv” from the working directory.
library(readr) # Loading the readr packagedata <-read_csv("study_dataset.csv") # Loading in study dataset as "data"
Rows: 88 Columns: 207
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (26): DOI, Title, Subject, Institution, Journal, Publisher, Country, Au...
dbl (181): Record ID, RetractionPubMedID, OriginalPaperDate, year, OriginalP...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I have viewed the data frame, and the data seems to have loaded correctly.
Data Cleaning
To conduct the analysis, we will first need to calculate the LingObf variable by calculating the abstraction index and jargon words; creating standardized scores for abstraction, cause, jargon, emo_pos, and flesch_re; and calculating the LingObf composite variable from these standardized scores.
First, we will calculate the abstraction index by creating standardized scores for article, prep, and quantity and summing them.
# Calculate standardized scores for article, prep, and quantity and add them to the datasetdata$articles_standardized <-scale(data$article)data$prep_standardized <-scale(data$prep)data$quantity_standardized <-scale(data$quantity)# Create the new variable 'abstraction' as the inverse of the sum of the three standardized variablesdata$abstraction <- (1/(data$articles_standardized + data$prep_standardized + data$quantity_standardized))
After viewing the data, the transformations and variable calculation seem to have occurred appropriately.
Next, we will calculate the jargon words by subtracting Dic from 100.
# Calculate the new variable 'jargon' by subtracting 'Dic' from 100data$jargon <- (100- data$Dic)
After viewing the data, the variable calculation seem to have occurred appropriately.
Next, we will create standardized scores for each subcomponent of the LingObf.
# Standardize the new set of variables and add them to the datasetdata$abstraction_standardized <-scale(data$abstraction)data$cause_standardized <-scale(data$cause)data$jargon_standardized <-scale(data$jargon)data$emo_pos_standardized <-scale(data$emo_pos)data$flesch_re_standardized <-scale(data$flesch_re)
After viewing the data, the variable transformations seem to have occurred appropriately.
Now we will calculate the LingObf variable using the following formula: [cause_standardized + abstraction_standardized + jargon_standardized] – [emo_pos_standardized + flesch_re_standardized].
After viewing the data, the variable calculation seem to have occurred appropriately.
Lastly, our variable that indicates certainty sentiment is currently certainty_avg, but to make things easier I am going to copy this data into a new variables called CertSent.
data$CertSent <- data$certainty_avg
To ensure that our clean data is saved, we will write the dataset to the current working directory.
# Writing our data as a csv file in the current working directorywrite.csv(data, "clean_study_data.csv")
I have opened the saved data file outside of Rstudio, and it seems to have been written correctly.
Data Exploration
Missing Data
Data will be first inspected for missing scores.
library(dplyr) # Loading the dplyr package for data manipulation and handling missing values
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# To summarize the number of missing values in each columnmissing_data_summary <-sapply(data, function(x) sum(is.na(x)))print(missing_data_summary) # To see summary of missing values for all columns
Taking a look at our outcome variables, there are no missing scores, so we do not need to impute any values.
Descriptive Statistics
I will now generate descriptive statistics for each relevant variable in the dataset. I originally (in the preregistered plan) was simply going to deploy the describe() function on the entire dataset, but because I retained all of the columns from the Retraction Watch Database (Retraction Watch Database, 2023) and all of the output from the text analysis packages (Aggarwal, 2022; Boyd et al., 2022; Rocklage et al., 2023) there are currently 219 variables. Because this would be unmanageable, I am going to only calculate descriptive statistics for a selection of variables of interest. I will first create a dataframe with only the continuous variables that I am interested in generating descriptive statistics for, and I will use the psych package to produce the descriptive statistics for these variables.
library(psych) # Loading the psych package# Selecting the continuous variables I am interested in getting descriptive statistics for (plus paper type for the next part)continuous_data_for_descriptives <- data[c("year", "Refs", "flesch_re", "WC", "abstraction", "jargon", "CertSent", "LingObf", "cause", "emo_pos", "article", "prep", "quantity", "PaperType")]# Generating descriptive statistics for variables of interestcontinuous_descriptive_stats_all <-describe(continuous_data_for_descriptives)# Displaying the results of the descriptive stats for the variables listed abovecontinuous_descriptive_stats_all
I won’t make too many comments here, because for most of these measures there are no formal or informal norms against which to judge them. That being said:
Year: The mean year is mid-2009, with a standard deviation of 10 years (max 2022 and min 1980), indicating that the sample is recent enough to be relevant but also spans quite a number of years. This I think is good insofar as the recent history of academic publishing is represented more fully (some recent investigations limited their search to the three years prior to publication). There is some negative skew, which I suspect is due to the 1980 paper being quite a bit older than most papers.
Refs: The mean number of references is 45.82 with a standard deviation of 23.47. At least intuitively, this seems like a pretty standard distribution of references if we were to randomly select papers from the literature. However, the range is huge, with one paper showing 146 references—perhaps why the skew is positive. This may be something to consider when making group comparisons, as this point may have significant leverage.
WC: The average word count is 4436.25, and the standard deviation is 2248.24. It is also positively skewed, and although this should not affect our linguistic dependent variables it may be a confounding factor for references. Indeed, taking a look at the dataset I can see that the second highest value for word count (9670) also has the maximum value for references (the outlier noted above at 146 references). This will be something to watch out for.
Now we will create descriptive statistics for each of the continuous variables above within the PaperType groups.
# Making PaperType a factor variable in the continuous variable dataframe to allow for groupingcontinuous_data_for_descriptives$PaperType <-factor(continuous_data_for_descriptives$PaperType, levels =c("SAFP", "MAFP", "SAGP", "MAGP"), labels =c("Single-Authored Fraudulent Papers", "Multi-Authored Fraudulent Papers", "Single-Authored Genuine Papers", "Multi-Authored Genuine Papers"))# Generating descriptive statistics within PaperType groupsdescriptive_stats_by_PaperType <-describeBy(continuous_data_for_descriptives, group = continuous_data_for_descriptives$PaperType)descriptive_stats_by_PaperType
Year: The year seems to be relatively similar between groups, with the single-author groups (SAGP and SAFP) both having mid-2008 as the mean and the multi-author groups both having mid-2010 as the mean. This may be partially due to there being single outliers (matched papers that are both outliers) at the lower end of the distribution.
Refs: As I suspected, the outlier of 146 seems to be pulling the mean for SAGP up to 51.05 (the means for all other groups are around 44). This is corroborated by the skew statistics which show that only the SAGP group has a skew value above 1. I may run the analyses where references is the outcome variable both with and without this case in the model, or I may use word count as a covariate in the model to try to account for it.
WC: For some reason, there seems to be a notable difference in mean word count between the single-authored fraudulent group and the rest of the groups. I could speculate as to why this is the case, but I may do an unplanned follow-up analysis of this.
Frequencies will now be produced for categorical variables, both for the data in general and within PaperType groups.
First, we will make the variables inst_pres, gender, simple_reason, Country, and PaperType factor variables.
# Changing inst_pres (institutional prestige), gender, simple_reason, Country, and PaperType into factor variablesdata$inst_pres <-factor(data$inst_pres, levels =c(0, 1), labels =c("Not Major Research Institution", "Major Research Institution"))data$gender <-factor(data$gender, levels =c("FEMALE", "MALE"), labels =c("Female", "Male"))data$simple_reason <-factor(data$simple_reason, levels =c("f_data", "f_image", "m_image", "f_data f_image"), labels =c("Fabricated/Falsified Data", "Fabricated/Falsified Image", "Manipulated Image", "Fabricated/Falsified Data and Image"))data$Country <-factor(data$Country)data$PaperType <-factor(data$PaperType)
Next, we will produce frequency tables for each categorical variable of interest for the whole dataset.
# Creating the frequency tables for each categorical variablefreq_tab_inst_pres <-table(data$inst_pres)freq_tab_gender <-table(data$gender)freq_tab_simple_reason <-table(data$simple_reason)freq_tab_Country <-table(data$Country)freq_tab_PaperType <-table(data$PaperType)# Displaying the frequency tablesfreq_tab_inst_pres
Not Major Research Institution Major Research Institution
70 18
freq_tab_gender
Female Male
15 73
freq_tab_simple_reason
Fabricated/Falsified Data Fabricated/Falsified Image
27 3
Manipulated Image Fabricated/Falsified Data and Image
12 2
freq_tab_Country
Australia Belgium China Egypt Ethiopia
1 1 10 3 2
India Iran Israel Italy Japan
10 1 1 1 3
Latvia Malaysia Netherlands Pakistan Poland
1 2 6 3 2
Portugal South Africa South Korea Taiwan Turkey
2 1 2 1 2
United Kingdom United States
4 29
freq_tab_PaperType
MAFP MAGP SAFP SAGP
22 22 22 22
Let’s take a look at the way these are split:
First, there seems to ba a preponderance of lower status (not major) research institutions in the whole dataset (70:18).
Next, most of the papers were retracted due to fabricated/falsified data (n = 27), followed by those that manipulated images (n = 12), fabricated/falsified images (n = 3), or both fabricated/falsified data and images (n = 2).
For country, the most common category of papers seem to be from the United States (n = 29), followed by China and India (both n = 10) and the Netherlands (n = 6), with no other countries totaling more than n = 4.
Finally, we can see that the papers are evenly split between each of our groups (n = 22; N = 88).
Next, we will produce the frequencies within PaperType groups. Because we were not able to perfectly match across our matching characteristics, we will take this opportunity to look at how far off we were. A proportion table will be produced to more easily compare frequencies across PaperType groups. First, however, we need to make the matching dummy variables into factor variables with labels to make them more interpretable.
I have checked the data set, and the labels seem to have been assigned correctly. Now we will produce the frequency tables and proportion tables.
Because we are taking a look at matching characteristics here, I will also take a look at the year_difference (the difference between the years of publication for the matched papers) variable within each group by using the describe() function from the psych package (like for the continuous variables above). In addition, I will take a look at the NumAuth (number of authors) for each of the groups.
# Frequency tables and proportion tables# institutional prestigefrequency_table_by_PaperType_inst_pres <-table(data$PaperType, data$inst_pres)frequency_table_by_PaperType_inst_pres
Not Major Research Institution Major Research Institution
MAFP 17 5
MAGP 18 4
SAFP 18 4
SAGP 17 5
Not Major Research Institution Major Research Institution
MAFP 0.7727273 0.2272727
MAGP 0.8181818 0.1818182
SAFP 0.8181818 0.1818182
SAGP 0.7727273 0.2272727
Australia Belgium China Egypt Ethiopia India
MAFP 0.00000000 0.00000000 0.18181818 0.00000000 0.00000000 0.13636364
MAGP 0.00000000 0.00000000 0.18181818 0.00000000 0.00000000 0.13636364
SAFP 0.00000000 0.04545455 0.09090909 0.09090909 0.04545455 0.18181818
SAGP 0.04545455 0.00000000 0.00000000 0.04545455 0.04545455 0.00000000
Iran Israel Italy Japan Latvia Malaysia
MAFP 0.00000000 0.00000000 0.00000000 0.04545455 0.00000000 0.04545455
MAGP 0.00000000 0.00000000 0.00000000 0.04545455 0.00000000 0.04545455
SAFP 0.00000000 0.00000000 0.00000000 0.04545455 0.00000000 0.00000000
SAGP 0.04545455 0.04545455 0.04545455 0.00000000 0.04545455 0.00000000
Netherlands Pakistan Poland Portugal South Africa South Korea
MAFP 0.09090909 0.04545455 0.04545455 0.04545455 0.00000000 0.00000000
MAGP 0.04545455 0.04545455 0.04545455 0.04545455 0.00000000 0.00000000
SAFP 0.04545455 0.00000000 0.00000000 0.00000000 0.00000000 0.04545455
SAGP 0.09090909 0.04545455 0.00000000 0.00000000 0.04545455 0.04545455
Taiwan Turkey United Kingdom United States
MAFP 0.00000000 0.00000000 0.00000000 0.36363636
MAGP 0.00000000 0.00000000 0.00000000 0.40909091
SAFP 0.00000000 0.04545455 0.04545455 0.31818182
SAGP 0.04545455 0.04545455 0.13636364 0.22727273
# And for the matching characteristic dummy variables# Country differentfrequency_table_by_PaperType_different_country <-table(data$PaperType, data$different_country)frequency_table_by_PaperType_different_country
Same Country Different Country
MAFP 14 8
MAGP 21 1
SAFP 0 0
SAGP 10 12
Same Journal Different Journal
MAFP 0.5 0.5
MAGP
SAFP
SAGP
# Descriptive statistics for year off within PaperType groupsyear_difference_descriptives <-describeBy(data$year_difference, group = data$PaperType, na.rm =TRUE)
Warning in min(x, na.rm = na.rm): no non-missing arguments to min; returning
Inf
Warning in max(x, na.rm = na.rm): no non-missing arguments to max; returning
-Inf
year_difference_descriptives
Descriptive statistics by group
group: MAFP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 -1.36 6.15 0 -0.22 1.48 -21 8 29 -1.9 3.45 1.31
------------------------------------------------------------
group: MAGP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 -0.64 2.15 0 -0.22 0 -10 1 11 -3.72 13.37 0.46
------------------------------------------------------------
group: SAFP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 0 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
------------------------------------------------------------
group: SAGP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 0 1.35 0 -0.06 0 -4 4 8 0.11 4.36 0.29
num_auth_descriptives <-describeBy(data$NumAuth, group = data$PaperType, na.rm =TRUE)num_auth_descriptives
Descriptive statistics by group
group: MAFP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 3.73 1.32 4 3.61 1.48 2 7 5 0.6 -0.17 0.28
------------------------------------------------------------
group: MAGP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 4.05 1.91 4 3.78 1.48 2 9 7 0.99 0.29 0.41
------------------------------------------------------------
group: SAFP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 1 0 1 1 0 1 1 0 NaN NaN 0
------------------------------------------------------------
group: SAGP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 1 0 1 1 0 1 1 0 NaN NaN 0
Overview of Categorical Variables:
Institutional Prestige: The split for frequencies/proportion of is very close to even across groups. The MAFP and SAGP groups have 17 papers from major research institutions and 5 from lower status research institutions (.77/.22), and the MAGP and SAFP groups have 18 papers from major research institutions and 4 from lower status research institutions (.82/.18).
Gender: The split between groups is also very similar for gender, with MAFP, MAGP, and SAGP groups each having 4 female first authors and 18 male first authors (.18/.82). SAFP had 3 female first authors and 19 male first authors (.14/.86).
Reason (Not Matched Between Groups During Data Collection): The reason for retraction is only relevant for the fraudulent paper groups, and there is a similar split between them. The SAFP had 15 papers in the group for Fabricated/Falsified Data, 2 paper for Fabricated/Falsified Images, 5 for Manipulated Images, and 0 that Fabricated/Falsified Data and Images (.68/.09/.22/.00). The MAFP had 12 papers in the group for Fabricated/Falsified Data, 1 paper for Fabricated/Falsified Images, 7 for Manipulated Images, and 2 for Fabricated/Falsified Data and Images (.55/.05/.32/.09).
Country: Because there are many countries, I will not delineate them all here but rather refer you to the data printed above if you are interested in diving deeper. To highlight one thing, we can see that papers from the United States (our most common country in the whole data set) tend to be overrepresented in the frequencies for each group (MAFP, 8/.36; MAGP, 9/.40; SAFP, 7/.32; SAGP, 5/.23). There are no other countries with proportions that exceed the lowest proportion of United States papers in any group (all<.23).
Overview of Matching Characteristic Dummy Variables and Average Differences:
Although the frequencies and proportions of the matched variables within groups above indicates to some degree how similar each group is, the more important metric of how closely matched the groups are is between the groups for which comparisons are to be made. For this reason, I dummy coded whether each paper matched it’s pair on each characteristic or not as a binary variable (which we labeled above) before loading the data. Because the SAFP papers were not gathered by matching to any group, there are no data for the SAFP group for any of the dummy variables, and journal only didn’t match across some of the SAFP and MAFP pairs. Each of the other groups is in reference to the group against which it will be compared (i.e., data for MAFP represent how well they match SAFP papers, data for MAGP represent how well they match with MAFP, data for SAGP represent how well they match with SAFP). In addition, I coded how different the year was between papers, such that positive values indicate the number of years the original paper was published after the matched paper (negative values indicate the matched paper was published after the paper it was being matched to).
MAFP compared with SAFP:
Country was the same for 14 (.64) pairs and different for 8 (.36) pairs.
Gender was the same for 18 (.82) pairs and different for 4 (.18) pairs.
Institutional prestige was the same for 17 (.77) pairs and different for 5 (.23) pairs.
Journal was the same for 11 (.50) pairs and different for 11 (.50) pairs.
The average year difference was -1.36, indicating that the MAFP tended to be slightly more recent than the SAFP.
Finally, the mean number of authors for MAFP is 3.73.
MAGP compared with MAFP:
Country was the same for 21 (.95) pairs and different for 1 (.05) pairs.
Gender was the same for 20 (.91) pairs and different for 2 (.09) pairs.
Institutional prestige was the same for 21 pairs (.95) and different for 1 (.05) pair.
Journal was the same for every paper.
The average year difference was -.64, indicating that the MAGP tended to be slightly more recent than the MAFP.
Finally, the mean number of authors for MAFP is 4.05.
SAGP compared with SAFP:
Country was the same for 10 (.45) pairs and different for 12 (.55) pairs.
Gender was the same for 19 (.86) pairs and different for 3 (.14) pairs.
Institutional prestige was the same for 21 pairs (.95) and different for 1 (.05) pair.
Journal was the same for every paper.
The average year difference was 0, indicating that the SAGP and SAFP are exactly matched for the average year of publication.
Summary of matching characteristics:
Looking at the data, the most difficult characteristic to match was country across all categories of papers, except for MAFP compared with SAFP, where journal was the most difficult to match. Nevertheless, for many of the situations where we were unable to match the specific paper, this was balanced out by not being able to match another paper in the opposite direction, as the frequencies within paper types demonstrates.
Data Visualization
Using the ggplot2 package, frequency distributions and box plots will be generated for LingObf, CertSent, and Refs. A bar plot will be produced for FraudCorrAuth.
library(ggplot2) # To load the ggplot2 package
Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':
%+%, alpha
# To produce histogram and box plot for LingObfggplot(data, aes(x = LingObf)) +geom_histogram(fill ="blue", color ="black") +labs(title ="Histogram of Linguistic Obfuscation for Entire Dataset")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(y = LingObf)) +geom_boxplot(fill ="lightblue", color ="black") +labs(title ="Box Plot of Linguistic Obfuscation for Entire Dataset")
# To produce histogram and box plot for CertSentggplot(data, aes(x = CertSent)) +geom_histogram(fill ="blue", color ="black") +labs(title ="Histogram of Certainty Sentiment for Entire Dataset")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(y = CertSent)) +geom_boxplot(fill ="lightblue", color ="black") +labs(title ="Box Plot of Certainty Sentiment for Entire Dataset")
# To produce histogram and box plot for Refsggplot(data, aes(x = Refs)) +geom_histogram(fill ="blue", color ="black") +labs(title ="Histogram of References for Entire Dataset")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(y = Refs)) +geom_boxplot(fill ="lightblue", color ="black") +labs(title ="Box Plot of References for Entire Dataset")
# To produce bar plot for FraudCorrAuthggplot(data, aes(x = FraudCorrAuth)) +geom_bar(fill ="coral") +labs(title ="Bar Plot of Fraudulent Corresponding Author for MAFP")
Warning: Removed 80 rows containing non-finite outside the scale range
(`stat_count()`).
LingObf: Linguistic obfuscation seems to have a relatively normal distribution with a somewhat strange jump in frequencies around 2.5 and a somewhat strange drop off in frequencies right above zero (M = 0, SD = 2.04; from descriptives above). The box plot shows a couple of potential outliers, but there are both high (2) and low (2) outliers, which may almost cancel eachother out in terms of being problematic.
CertSent: Certainty sentiment does not seem to be skewed despite having an outlier at the low end of the distribution as well. The box plot highlights three values at the low end of the scale as being outliers.
Refs: References has a strange distribution, one that looks like it is actually composed of two latent classes. It also has two outlier values way at the high end of the distribution, as shown by the histogram and the box plot.
FraudCorrAuth: The fraudulent corresponding author variable shows an even split between papers for which the fraudulent author was and was not the corresponding author.
Using the ggplot2 package, frequency distributions and box plots will be generated for LingObf, CertSent, and Refs within PaperType groups.
# To produce histogram and box plot for LingObfggplot(data, aes(x = LingObf)) +geom_histogram(fill ="blue", color ="black") +facet_wrap(~ PaperType) +labs(title ="Histograms of Linguistic Obfuscation for Entire Dataset")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(y = LingObf)) +geom_boxplot(fill ="lightblue", color ="black") +facet_wrap(~ PaperType) +labs(title ="Box Plots of Linguistic Obfuscation Within Paper Type Groups")
# To produce histogram and box plot for CertSentggplot(data, aes(x = CertSent)) +geom_histogram(fill ="blue", color ="black") +facet_wrap(~ PaperType) +labs(title ="Histograms of Certainty Sentiment Within Paper Type Groups")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(y = CertSent)) +geom_boxplot(fill ="lightblue", color ="black") +facet_wrap(~ PaperType) +labs(title ="Box Plots of Certainty Sentiment Within Paper Type Groups")
# To produce histogram and box plot for Refsggplot(data, aes(x = Refs)) +geom_histogram(fill ="blue", color ="black") +facet_wrap(~ PaperType) +labs(title ="Histograms of References Within Paper Type Groups")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data, aes(y = Refs)) +geom_boxplot(fill ="lightblue", color ="black") +facet_wrap(~ PaperType) +labs(title ="Box Plots of References Within Paper Type Groups")
A few quick notes:
LingObf:
Outliers for linguistic obfuscation seem to be represented in each of the paper groups except for MAFP.
CertSent:
The otulier at the low end of the certainty sentiment distribution seems to belong to the MAGP group.
Refs:
The positive outliers we have noted a couple times now for references seem to be in the SAGP group, corroborating our speculation above (see Descriptive Statistics).
Strangely, we noted that the references seemed to have two latent classes, but, at least visually, this does not seem to represent any particular paper group, as the latent classes show up in the SAFP and MAGP groups (as well as perhaps the SAGP group). I am not sure why this is.
Overall, the strange values are nto all within one group, which suggests that they are due to random variation rather than one paper that was not read appropriately by the text analysis programs (or some other form of systematic problem with one or two observations in the data set).
Bivariate Correlations
In order to further explore data characteristics, bivariate correlations between outcome variables will be produced. As these are only for exploring data and not testing predictions, assumptions will not be tested, and the relationships will not be probed for statistical significance. Then, correlations for subcomponents of composite variables will be produced.
First, Pearson bivariate correlations will be produced for LingObf, CertSent, and Refs.
# To produce correlation matrix for LingObf, CertSent, and Refs variables within whole dataset# To calculate the Pearson correlation coefficients between the numerical outcome variablescorrelation_matrix_num_outcomes <-cor(data[c("LingObf", "CertSent", "Refs")], use ="complete.obs", method ="pearson")# To display the correlation matrix for numerical outcome variablescorrelation_matrix_num_outcomes
All of the variables here show negligible bivariate relationships, and none of them are significant (I produced the correlation table from another script to see the significance).
Next, here are the point-biserial correlations between FraudCorrAuth and LingObf, CertSent, and Refs.
# Calculating Point-biserial correlations between FraudCorrAuth and numerical outcome variablespbcorr_FraudCorrAuth_LingObf <-cor(data$FraudCorrAuth, data$LingObf, use ="complete.obs", method ="pearson")pbcorr_FraudCorrAuth_CertSent <-cor(data$FraudCorrAuth, data$CertSent, use ="complete.obs", method ="pearson")pbcorr_FraudCorrAuth_Refs <-cor(data$FraudCorrAuth, data$Refs, use ="complete.obs", method ="pearson")# Displaying the correlation coefficientspbcorr_FraudCorrAuth_LingObf
[,1]
[1,] 0.5678883
pbcorr_FraudCorrAuth_CertSent
[1] 0.8266114
pbcorr_FraudCorrAuth_Refs
[1] -0.2157277
While the relationship between FraudCorrAuth and LingObf is large (r = .57) and the relationship between FraudCorrAuth and CertSent is very large (r = .83), I will refrain from interpreting these because the sample is severely underpowered with a mere n = 8 observations.
Now, we will produce bivariate correlations for the subcomponents of the abstraction index (see Data Cleaning for more details.
# Calculate pairwise correlations among the subcomponents of the abstraction indexcorrelation_matrix_abstraction_subcomponents <-cor(data[,c("article", "prep", "quantity")])# Display the correlation matrixcorrelation_matrix_abstraction_subcomponents
The correlations here (.03 < r < .1) are quite a bit smaller (i.e., nonexistent) than Markowitz & Hancock (2016) found for their sample when they constructed the abstraction index (.176 < r < .263). This calls into question the reliability of the abstraction index as a subcomponent of the linguistic obfuscation index in our sample, and it calls into question the validity of each of the indicators of the abstraction index (articles, prepositions, and quantity words) in our sample.
Now we will produce Pearson bivariate correlations for the subcomponents of the LingObf index.
# Calculate pairwise correlations among the subcomponents of LingObfcorrelation_matrix_LingObf_subcomponents <-cor(data[,c("cause", "abstraction", "jargon", "emo_pos", "flesch_re")])# Display the correlation matrixcorrelation_matrix_LingObf_subcomponents
If we let the Linguistic Obfuscation Index be O, causal terms be C, the abstraction index be A, jargon be J, positive emotion terms be P, and Flesch reading ease be F, the mathematical model for the construction of the Linguistic Obfuscation index for an observation i can be stated as follows:
\[
O_i = (C_i + A_i + J_i) - (P_i + F_i)
\]
If this model is internally consistent, we should expect that each of the variables within the aggregated variables (i.e., that are within parentheses with one another) should be positively correlate. We should also expect that variables will be negatively correlated across the aggregates (i.e., each of the variables that are in opposite aggregates).
Expected Positive Correlations:
We should expect the relationships between causal terms, the abstraction index, and jargon words to be positive. However, the relationships are (small) negative relationships, with the exception of the relationship between abstraction and jargon, which is very slightly positive.
We also should expect the relationship between positive emotion words and Flesch reading ease to be positive, but it is also a small, negative correlation.
Expected Negative Correlations:
As we should expect, there is a moderate negative relationship between causal terms and Flesch reading ease.
However, we should expect the relationship between causal terms and positive emotion terms to be negative, but it is a small, positive correlation.
We should expect a negative relationship between the abstraction index and positive emotion terms, but there is no relationship to speak of.
We also should expect a negative relationship between the abstraction index and Flesch reading ease, but there is no relationship to speak of (very, very slightly positive).
As should expect, there is a small negative relationship between Jargon and positive emotion terms.
Lastly, we should expect a negative correlation between jargon and Flesch reading ease, but there is a small, positive correlation between them.
The observed relationships here may be expected a priori in some cases (e.g., the relationship between jargon and Flesch reading ease may be expected to be a positive relationship because the longer sentences someone writes the more likely they may be to use nonstandard words, simply by volume). However, from an internal consistency perspective, most of the relationships (7/10) we have observed here are either in the wrong direction or are too small to be considered.
This warrants serious consideration for whether our findings for hypothesis A can be interpreted appropriately.
Testing Hypotheses
Hypothesis 1: Linguistic Obfuscation Hypothesis
Hypothesis 1 consists of two variants. The first, Hypothesis 1a, is the general linguistic obfuscation hypothesis tested by Markowitz & Hancock (2016) which we will test in order to replicate their previous findings. The second, Hypothesis 1b, consists of our specific variant of the linguistic obfuscation hypothesis that states fraudulent scientists obfuscate their writing to make their work less accessible to their research group, rather than those outside their research group. This leads to the prediction that SAFP—which presumably are written without the presence of a research group—will use less linguistic obfuscation. Specifically, these hypotheses are stated follows:
Hypothesis 1a: Fraudulent research will be written with more linguistic obfuscation than non-fraudulent research. [Replication]
Hypothesis 1b: Single-author fraudulent research will be written with more linguistic obfuscation than non-fraudulent research but less linguistic obfuscation than multi-author fraudulent research. [Novel]
Testing Hypothesis 1a
To test Hypothesis 1a, we will conduct an independent samples t-test comparing the mean levels of LingObf of fraudulent papers (SAFP and MAFP combined) with genuine papers (SAGP and MAGP combined). That is, we will see whether the mean linguistic obfuscation differs between fraudulent papers and genuine papers.
To compare fraudulent publications with genuine publications, we will first add a new dichotomous grouping variable for genuine or fraudulent papers (Genuine_or_Fraudulent). That is, the new variable will combine SAGP and MAGP into one category (GPaper) and SAFP and MAFP into another category (FPaper). This will be done using the dplyr package. Then we will make the variable a factor variable.
This transformation also seems to have been done correctly.
To check the assumption of normality we will use the ggplot2 package to create Q-Q plots for LingObf within the FPaperand GPaper groups. The Q-Q plots will be investigated visually. Because each group will has n = 44 cases, our t-test will not necessarily be rhobust to violations of this assumption. If the assumption of normality is violated and we cannot fix it, then we will use a bootstrapped t-test to fit a more robust model.
# To produce Q-Q plots for LingObf within the FPaper and GPaper groupsggplot(data, aes(sample = LingObf)) +stat_qq() +facet_wrap(~ Genuine_or_Fraudulent) +geom_qq_line()
The QQ plots are not that close to the theoretical values, with deviation at the ends of the distributions (particularly in the fraudulent paper group). Because of this, I will run a Shapiro-Wilk test on each of the groups.
# Loading the dplyr package for the pipe functionlibrary(dplyr)# Performing the Shapiro-Wilk test for normality within each groupsw_results_LingObf_GorP <- data %>%group_by(Genuine_or_Fraudulent) %>%summarise(SW_test_result =list(shapiro.test(LingObf)))# Displaying the resultssw_results_LingObf_GorP$SW_test_result[[1]]
Shapiro-Wilk normality test
data: LingObf
W = 0.96576, p-value = 0.2128
sw_results_LingObf_GorP$SW_test_result[[2]]
Shapiro-Wilk normality test
data: LingObf
W = 0.97439, p-value = 0.4279
The Shapiro-Wilk tests indicate that for both groups we should accept the null hypothesis that the data are normally distributed. Our assumption of normality is therefore supported.
We will now run Levene’s test for homogeneity of variance.
# To load the car package for Levene's testlibrary(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:psych':
logit
The following object is masked from 'package:dplyr':
recode
# Converting LingObf to numeric again, becuase there is a bug that won't allow me to run it if I don't do thisdata$LingObf <-as.numeric(data$LingObf)# To conduct Levene's test for homogeneity of varianceshyp_1a_levene_test <-leveneTest(LingObf ~ Genuine_or_Fraudulent, data = data)# To display the Levene's test resulthyp_1a_levene_test
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.9065 0.3437
86
Results of Levene’s test indicate that we should accept the null hypothesis that the variances are equal between the groups.
Now we will run the independent samples t-test with bootstrapped confidence intervals.
# Load effsize package for Cohen's dlibrary(effsize)
Attaching package: 'effsize'
The following object is masked from 'package:psych':
cohen.d
# Performing the t-teststud_t_test_obf <-t.test(LingObf ~ Genuine_or_Fraudulent, data = data)# Calculating Cohen's d using the effsize packaged_obf <-cohen.d(LingObf ~ Genuine_or_Fraudulent, data = data)# Reporting the resultscat("Mean Difference: ", stud_t_test_obf$estimate, "\n","t-value: ", stud_t_test_obf$statistic, "\n","p-value: ", stud_t_test_obf$p.value, "\n","Cohen's d: ", d_obf$estimate, "\n")
The results of the t-test do not provide reliable evidence against the null for Hypothesis 1a. We should accept the null hypothesis that the means of the two groups are the same.
Testing Hypothesis 1b
To test Hypothesis 1b we will conduct a one-way analysis of variance (ANOVA) to compare group means of LingObf for SAFP, MAFP, SAGP, and MAGP. That is, we will determine whether SAFP contains more linguistic obfuscation than the genuine paper groups but less linguistic obfuscation than the MAFP.
To check the assumption of normality we will use the ggplot2 package to create Q-Q plots for LingObf using our original study dataset. The Q-Q plots will be investigated visually. If the assumption of normality is violated we will use a bootstrapped t-test to fit a more robust model.
# To produce Q-Q plots for LingObf within the PaperType groupsggplot(data, aes(sample = LingObf)) +stat_qq() +facet_wrap(~ PaperType) +geom_qq_line()
Although they don’t look terrible, I am not satisfied with the plots within PaperType groups for SAFP, MAGP, or SAGP. Therefore, we will test for non-normality using a Shapiro-Wilk test within each group.
# Performing the Shapiro-Wilks test for normality within each groupsw_results_LingObf_PT <- data %>%group_by(PaperType) %>%summarise(SW_test_result2 =list(shapiro.test(LingObf)))# Displaying the resultssw_results_LingObf_PT$SW_test_result2[[1]]
Shapiro-Wilk normality test
data: LingObf
W = 0.96423, p-value = 0.5792
sw_results_LingObf_PT$SW_test_result2[[2]]
Shapiro-Wilk normality test
data: LingObf
W = 0.93083, p-value = 0.1277
sw_results_LingObf_PT$SW_test_result2[[3]]
Shapiro-Wilk normality test
data: LingObf
W = 0.95552, p-value = 0.4043
sw_results_LingObf_PT$SW_test_result2[[4]]
Shapiro-Wilk normality test
data: LingObf
W = 0.95824, p-value = 0.4546
The results for the Shapiro-Wilk test for normality indicate that we should accept the null hypothesis that the data are normally distributed for each group. We will now check the assumption of homogeneity of variances.
Running Levene’s test for homogeneity of variances for the PaperType groups.
# To conduct Levene's test for homogeneity of varianceshyp_1b_levene_test <-leveneTest(LingObf ~ PaperType, data = data)# To display the Levene's test resulthyp_1b_levene_test
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 1.7857 0.1561
84
The result of Levene’s test indicates that we may accept the null hypothesis that the variances are equal between the groups. Therefore, we will consider the assumption of homogeneity supported and run our one-way ANOVA.
We will now run the one-way ANOVA.
# Perform and print the ANOVAaov_obf <-aov(LingObf ~ PaperType, data=data)summary(aov_obf)
Df Sum Sq Mean Sq F value Pr(>F)
PaperType 3 23.2 7.737 1.916 0.133
Residuals 84 339.2 4.038
The results of this bootstrapped one-way ANOVA indicate that we should accept the null hypothesis that there are no differences between the PaperType groups on the LingObf variable. No post-hoc test needs to be conducted.
Hypothesis 2: References Hypothesis
Hypothesis 2 also consists of two variants. The first is that fraudulent research will contain more references than non-fraudulent research, functioning to make the research more costly to assess from outside readers (Markowitz & Hancock, 2016) or as an analogue to third-person pronoun usage in other linguistic studies of deception (Schmidt, 2022). The second is our adaptation of the first version which emphasizes the salience of the research group as the audience of this communicative style, similar to the logic of Hypothesis 1b. Specifically, the hypotheses are as follows:
Hypothesis 2a: Fraudulent research will contain more references than non-fraudulent research. [Replication]
Hypothesis 2b: Single-author fraudulent research will contain more references than non-fraudulent research but fewer references than multi-author fraudulent research. [Novel]
Testing Hypothesis 2a
To test Hypothesis 2a, we will compare the mean number of Refs between the FPaper and GPaper groups using an independent-samples t-test.
To check the assumption of normality we will use the ggplot2 package to create Q-Q plots for Refs within the Genuine_or_Fraudulent dichotomous variable. The Q-Q plots will be investigated visually. As references are technically count, rather than continuous, data, there is a higher likelihood that it will violate the assumption of normality. If this is the case we will instead run a Mann-Whitney U test to compare the two groups by their median Refs.
# To produce Q-Q plots for Refs within the FPaper and GPaper groupsggplot(data, aes(sample = Refs)) +stat_qq() +facet_wrap(~ Genuine_or_Fraudulent) +geom_qq_line()
The QQ-plots do not look promising, especially for genuine papers, which we noted likely contained outlier values before. We will test normality more formally with a Shapiro-Wilks test.
# Performing the Shapiro-Wilks test for normality within each groupsw_results_Refs_GorP <- data %>%group_by(Genuine_or_Fraudulent) %>%summarise(SW_test_result3 =list(shapiro.test(Refs)))# Displaying the resultssw_results_Refs_GorP$SW_test_result3[[1]]
Shapiro-Wilk normality test
data: Refs
W = 0.92403, p-value = 0.006505
sw_results_Refs_GorP$SW_test_result3[[2]]
Shapiro-Wilk normality test
data: Refs
W = 0.86705, p-value = 0.0001223
For both groups, the Shapiro-Wilks test for normality indicates that we should reject the null hypothesis that the data is normally distributed. Therefore, we will move right ahead with a Mann-Whitney U test, skipping Levene’s test for homogeneity of variances.
Conducting the Mann-Whitney U test.
# To conduct Mann-Whitney U testhyp_two_a_mann_test <-wilcox.test(Refs ~ Genuine_or_Fraudulent, data = data)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
# Displaying the resultshyp_two_a_mann_test
Wilcoxon rank sum test with continuity correction
data: Refs by Genuine_or_Fraudulent
W = 934.5, p-value = 0.7829
alternative hypothesis: true location shift is not equal to 0
The results of the Mann-Whitney U test indicate that we should accept the null hypothesis that the medians of the two groups are different. Although we mentioned in the data exploration section that we might want to remove the outlier before testing hypotheses with references as the outcome variable, the Mann-Whitney U test does not use means and should therefore not be biased due to outliers leverage on the statistics used. Therefore, it is not necessary to test this with the outlier removed.
Testing Hypothesis 2b
To test Hypothesis 2b we will conduct a one-way analysis of variance (ANOVA) to compare group means of Refs for SAFP, MAFP, SAGP, and MAGP. That is, we will determine whether SAFP contains more linguistic obfuscation than the genuine paper groups but less linguistic obfuscation than the MAFP. As references are technically count, rather than continuous, data, there is a higher likelihood that it will violate the assumption of normality, so in this case we will instead run a Kruskal-Wallis test to compare the groups by their median Refs.
To check the assumption of normality we will use the ggplot2 package to create Q-Q plots for Refs. The Q-Q plots will be investigated visually. If the assumption of normality is violated we will use the non-parametric Kruskal-Wallis test instead.
# To produce Q-Q plots for Refs within the PaperType groupsggplot(data, aes(sample = Refs)) +stat_qq() +facet_wrap(~ PaperType) +geom_qq_line()
Again, the data do not look like they are normally distributed based on the QQ-plots, especially for SAGP. To test this formally, we conduct Shapiro-Wilks tests for normality within groups.
# Performing the Shapiro-Wilks test for normality within each groupsw_results_Refs_PT <- data %>%group_by(PaperType) %>%summarise(SW_test_result4 =list(shapiro.test(Refs)))# Displaying the resultssw_results_Refs_PT$SW_test_result4[[1]]
Shapiro-Wilk normality test
data: Refs
W = 0.92665, p-value = 0.1045
sw_results_Refs_PT$SW_test_result4[[2]]
Shapiro-Wilk normality test
data: Refs
W = 0.8939, p-value = 0.02249
sw_results_Refs_PT$SW_test_result4[[3]]
Shapiro-Wilk normality test
data: Refs
W = 0.9078, p-value = 0.0427
sw_results_Refs_PT$SW_test_result4[[4]]
Shapiro-Wilk normality test
data: Refs
W = 0.86365, p-value = 0.005977
Indeed, the Shapiro-Wilks tests (except for the MAFP group) indicate that we should reject the null hypothesis that the data are normally distributed. We will now skip past testing for homogeneity of variances to the Kruskal-Wallis test as a robust version of a one-way ANOVA to test if there are any differences between group medians for references.
Now to conduct the Kruskal-Wallis test.
# To conduct Kruskal-Wallishyp_two_b_krusk <-kruskal.test(Refs ~ PaperType, data = data)# To print the results of the Kruskal-Wallis testhyp_two_b_krusk
Kruskal-Wallis rank sum test
data: Refs by PaperType
Kruskal-Wallis chi-squared = 0.49984, df = 3, p-value = 0.9189
The Kruskal-Wallis test indicates that we should accept the null hypothesis that there are no differences between the median Refs across groups. Similarly to Hypothesis 2a, this test should be robust to outliers, so there is no need to remove them here.
Hypothesis 3: Certainty
Hypothesis 3 investigates the use of certainty language in cases of scientific fraud. While a case study of Deidrik Stapel tended to use more certain language (Markowitz & Hancock, 2014), others have found less certainty language in retracted papers than non-retracted papers (Dehdarirad & Schirone, 2023). Thus, Hypothesis 3 is meant to provide clarity regarding the use of certainty language in scientific fraud. Specifically, it is stated as follows:
Hypothesis 3: Fraudulent research will contain less certainty than non-fraudulent research. [Replication]
Testing Hypothesis 3:
To test Hypothesis 3, we will compare the mean CertSent score between the FPaper and GPaper groups using an independent-samples t-test.
To check the assumption of normality we will use the ggplot2 package to create Q-Q plots for CertSent within the FPaper and GPaper groups. The Q-Q plots will be investigated visually. If the assumption of normality is violated we will use a bootstrapped t-test to fit a more robust model.
# To produce Q-Q plots for CertSent within the FPaper and GPaper groupsggplot(data, aes(sample = CertSent)) +stat_qq() +facet_wrap(~ Genuine_or_Fraudulent) +geom_qq_line()
The QQ-plots seem to diverge from normality in the middle of the distribution (for fraudulent papers) and at the lower end of the distribution (for the genuine papers). I will assess non-normality more formally using Shapiro-Wilks tests within groups.
# Performing the Shapiro-Wilks test for normality within each groupsw_results_CertSent_GorF <- data %>%group_by(Genuine_or_Fraudulent) %>%summarise(SW_test_result5 =list(shapiro.test(CertSent)))# Displaying the resultssw_results_CertSent_GorF$SW_test_result5[[1]]
Shapiro-Wilk normality test
data: CertSent
W = 0.98578, p-value = 0.8563
sw_results_CertSent_GorF$SW_test_result5[[2]]
Shapiro-Wilk normality test
data: CertSent
W = 0.96977, p-value = 0.2971
The results from the Shapiro-Wilks test for normality indicate that we should accept the null hypothesis that the data are normally distributed within groups.
Thus, we may assume normality for our t-test.
To check the assumption of homogeneity of variances, we will conduct Levene’s test using the car package.
# To load the car package for Levene's testlibrary(car)# To conduct Levene's test for homogeneity of varianceshyp_three_levene_test <-leveneTest(CertSent ~ Genuine_or_Fraudulent, data = data)# To display the Levene's test resulthyp_three_levene_test
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 4.0182 0.04816 *
86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene’s test for homogeneity of variance indicates that we should reject the null hypothesis that the variances of the two groups is equal. Thus, we will run a Welch’s t-test, which does not assume equal variances.
Conducting the Welch’s t-test.
# To conduct Welch's t-testhyp_three_welch_t_test <-t.test(CertSent ~ Genuine_or_Fraudulent, data = data, var.equal =FALSE)# To print the Welch's t-test resulthyp_three_welch_t_test
Welch Two Sample t-test
data: CertSent by Genuine_or_Fraudulent
t = 1.6329, df = 73.47, p-value = 0.1068
alternative hypothesis: true difference in means between group Fraudulent Papers and group Genuine Papers is not equal to 0
95 percent confidence interval:
-0.0193144 0.1945871
sample estimates:
mean in group Fraudulent Papers mean in group Genuine Papers
6.213636 6.126000
Contrary to our hypothesis, Fraudulent Papers actually have higher mean certainty sentiment than Genuine Papers, although this result is not significant.
Hypothesis 4 is a logical extension of the notion that to hide information you must first have control over related information flow. Based on this principle, Hypothesis 4 is formulated as follows:
Hypothesis 4: For multi-author fraudulent papers, the fraudulent author will be the corresponding author more frequently than other authors. [Novel]
Hypothesis 4 rests on the following two assumptions:
First, all authors on a paper are equally likely to be the corresponding author. Although this assumption oversimplifies norms of assigning scientists to be corresponding authors, in the absence of more information regarding, for example, author order, author responsibilities, or laboratory status, it is the most accurate prediction we can make regarding the likelihood of an author being the corresponding author. That is, in the absence of other information it is the baseline prediction.
Second, if fraudulent authors to not attempt to control information flow, they are equally likely to be the corresponding as their coauthors are. This may also oversimplify the norms of assigning scientists to be corresponding authors. For example, it is possible that fraudulent authors perform more data management, and it is possible that scientists that are responsible for data management are more likely to be corresponding authors. However, given the absence of this information (e.g., regarding research group norms), this assumption is reasonable.
These two assumptions allow us to make a prediction regarding the baseline expected frequency that fraudulent authors will be the corresponding authors of fraudulent research papers if they do not attempt to control information flow.
Testing Hypothesis 4:
To test Hypothesis 4 we will conduct a binomial test to compare the observed likelihood that the fraudulent author is the corresponding author (i.e., FraudCorrAuth) to the expected probability given the assumption of equal likelihood for all authors and the average number of authors on each MAFP.
To attain the expected probabiltiy we will calculate the inverse of the mean number of authors on all MAFP where the corresponding author is known. To do this, we will first need to filter out other PaperType categories and MAFP cases where the corresponding author is unknown (i.e., FraudCorrAuth has a missing value) to make a new data frame, hypothesis_four_df. Then, we will calculate the mean NumAuth within this data frame and take its inverse.
# To create data frame that only includes MAFP with a score for FraudCorrAuthhypothesis_four_df <- data[data$PaperType =='MAFP'& data$FraudCorrAuth %in%c("0", "1"), ]# To calculate the mean number of authors within the data framemean_MAFP_NumAuth_corrauth_known <-mean(hypothesis_four_df$NumAuth)# To display the mean valuemean_MAFP_NumAuth_corrauth_known
[1] 3.125
# To calculate the inverse of mean_MAFP_NumAuth_corrauth_knownexpected_prob_FraudCorrAuth <-1/ mean_MAFP_NumAuth_corrauth_known# To display the expected probabilityexpected_prob_FraudCorrAuth
[1] 0.32
The mean number of authors in this subsample is 3.125, and the expected likelihood that the fraudulent author would be the corresponding author is therefore .32.
Next, we will conduct a binomial test to compare the observed likelihood that a corresponding author is the fraudulent author to the expected probability (expected_prob_FraudCorrAuth).
# To run the binomial test with the expected probability within the hypothesis four data framehyp_four_binomial_test <-binom.test(sum(hypothesis_four_df$FraudCorrAuth), nrow(hypothesis_four_df), p = expected_prob_FraudCorrAuth, alternative ="two.sided")# To print the binomial test resultshyp_four_binomial_test
Exact binomial test
data: sum(hypothesis_four_df$FraudCorrAuth) and nrow(hypothesis_four_df)
number of successes = 4, number of trials = 8, p-value = 0.2776
alternative hypothesis: true probability of success is not equal to 0.32
95 percent confidence interval:
0.1570128 0.8429872
sample estimates:
probability of success
0.5
Although our observed likelihood is slightly higher than our expected probability, the hypothesis test indicates that we should accept the null hypothesis that there is no difference between them.
Of course, this test has very low power, with only 8 trials.
Saving Data
Now I will quickly export the main data frame to a csv file.
The data is now available as “post_analysis_study_dataset.csv”.
Follow Up Analysis
This portion of the analysis was not preregistered. However, as Dr. Arbeau points out, the word count of papers may be a confound for our analyses assessing Refs between groups. To assess this possibility, we will create a new variable of the ratio of references to word count in a paper, and we will use this as our outcome variable to re-test Hypothesis 2a and Hypothesis 2b. Although both the Lexical Suite and Linguistic Inquiry and Word Count (LIWC) have produced word count variables in our dataset, will use the word count from the LIWC (i.e., WC), as this was the word count variable used to assess descriptive statistics.
Retesting Hypothesis 2a
To retest Hypothesis 2a, we will first compute a new variable RefWordRatio by dividing Refs by WC. Then we will compare the mean RefWordRatio between the FPaper and GPaper groups using an independent-samples t-test.
Computing the RefWordRatio.
data$RefWordRatio <- (data$Refs/data$WC)
Computing descriptive statistics for RefWordRatio both in the whole dataset and between FPaper and GPaper groups.
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 88 0.01 0 0.01 0.01 0 0 0.03 0.02 0.82 0.13 0
RefWordRatio_ds_by_GorF <-describeBy(data$RefWordRatio, group = data$Genuine_or_Fraudulent)RefWordRatio_ds_by_GorF
Descriptive statistics by group
group: Fraudulent Papers
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 44 0.01 0 0.01 0.01 0 0 0.02 0.02 0.57 -0.59 0
------------------------------------------------------------
group: Genuine Papers
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 44 0.01 0.01 0.01 0.01 0 0 0.03 0.02 1 0.49 0
In whole dataset:
Interestingly, the skew statistic is lower for the RefWordRatio variable compared to the Refs variable alone (see Descriptive Statistics).
Within groups:
The skew statistic seems to be higher in the genuine paper group. We will assess this with QQ-plots.
Testing assumptions:
First, we will assess normality with QQ-plots within groups.
The data for both groups is not very tight to the theoretical values. We should take a look at the Shapiro-Wilk test.
# Performing the Shapiro-Wilk test for normality within each groupsw_results_RefWordRatio_GorP <- data %>%group_by(Genuine_or_Fraudulent) %>%summarise(SW_test_result6 =list(shapiro.test(RefWordRatio)))# Displaying the resultssw_results_RefWordRatio_GorP$SW_test_result6[[1]]
Shapiro-Wilk normality test
data: RefWordRatio
W = 0.94859, p-value = 0.04877
sw_results_RefWordRatio_GorP$SW_test_result6[[2]]
Shapiro-Wilk normality test
data: RefWordRatio
W = 0.91456, p-value = 0.003147
The results of the Shapiro-Wilk test indicate that we should reject the null hypothesis that the data are normally distributed for either group. Therefore, we will conduct a bootstrapped t-test to compare the two groups using an empirically determined, robust confidence interval. This will also allow us to skip the test for homogeneity of variance.
Conducting the bootstrapped t-test.
# Loading the boot package for bootstrappinglibrary(boot)
Attaching package: 'boot'
The following object is masked from 'package:car':
logit
The following object is masked from 'package:psych':
logit
# Setting the seedset.seed(616913)# Defining the statistic function for bootstrappingt_stat_boot_RefWCRatio <-function(data, index) { resampled_data_RefWCRatio <- data[index, ] # Resample the data with replacement t_result <-t.test(RefWordRatio ~ Genuine_or_Fraudulent, data=resampled_data_RefWCRatio)return(t_result$statistic)}# Performing the bootstrappingboot_results_RefWCRatio <-boot(data = data, statistic = t_stat_boot_RefWCRatio, R =1000)# Performing the regular t-teststud_t_test_RefWCRatio <-t.test(RefWordRatio ~ Genuine_or_Fraudulent, data = data)# Calculating Cohen's d using the effsize packaged_RefWCRatio <-cohen.d(RefWordRatio ~ Genuine_or_Fraudulent, data = data)# Calculate the 95% confidence interval from the bootstrapped resultsboot_ci_RefWCRatio <-boot.ci(boot_results_RefWCRatio, type ="bca")# Report the resultscat("Mean Difference: ", stud_t_test_RefWCRatio$estimate, "\n","95% CI: ", boot_ci_RefWCRatio$bca[4], boot_ci_RefWCRatio$bca[5], "\n","t-value: ", stud_t_test_RefWCRatio$statistic, "\n","p-value: ", stud_t_test_RefWCRatio$p.value, "\n","Cohen's d: ", d_RefWCRatio$estimate, "\n")
The results of the t-test with bootstrapped confidence intervals indicate that we should accept the null hypothesis that there is no difference between the two groups.
Retesting Hypothesis 2b
To retest Hypothesis 2b we will compare the mean RefWordRatio between the PaperType groups using an analysis of variance, after computing descriptive statistics within PaperTyped groups.
Computing descriptive statistics within PaperType groups.
RefWordRatio_ds_by_PaperType <-describeBy(data$RefWordRatio, group = data$PaperType)RefWordRatio_ds_by_PaperType
Descriptive statistics by group
group: MAFP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 0.01 0 0.01 0.01 0 0 0.02 0.02 0.68 -0.47 0
------------------------------------------------------------
group: MAGP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 0.01 0 0.01 0.01 0 0.01 0.02 0.01 0.77 -0.12 0
------------------------------------------------------------
group: SAFP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 0.01 0 0.01 0.01 0.01 0.01 0.02 0.02 0.47 -0.86 0
------------------------------------------------------------
group: SAGP
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 22 0.01 0.01 0.01 0.01 0.01 0 0.03 0.02 0.75 -0.55 0
There is not much to say about these descriptives (particularly when they are so small that it is difficult to assess them).
Testing assumptions:
First, we will assess normality with QQ-plots within groups.
# To produce Q-Q plots for Refs within the PaperType groupsggplot(data, aes(sample = RefWordRatio)) +stat_qq() +facet_wrap(~ PaperType) +geom_qq_line()
The SAGP group looks like it deviated from normality towards the bottom of the distribution, so we will be performing the Shapiro-Wilk test for normality within PaperType groups.
# Performing the Shapiro-Wilk test for normality within each groupsw_results_RefWordRatio_PT <- data %>%group_by(PaperType) %>%summarise(SW_test_result7 =list(shapiro.test(RefWordRatio)))# Displaying the resultssw_results_RefWordRatio_PT$SW_test_result7[[1]]
Shapiro-Wilk normality test
data: RefWordRatio
W = 0.93499, p-value = 0.1558
sw_results_RefWordRatio_PT$SW_test_result7[[2]]
Shapiro-Wilk normality test
data: RefWordRatio
W = 0.93047, p-value = 0.1255
sw_results_RefWordRatio_PT$SW_test_result7[[3]]
Shapiro-Wilk normality test
data: RefWordRatio
W = 0.95489, p-value = 0.3933
sw_results_RefWordRatio_PT$SW_test_result7[[4]]
Shapiro-Wilk normality test
data: RefWordRatio
W = 0.91003, p-value = 0.04742
Although for most of the groups we can accept the null hypothesis of normally distributed data, the results for the SAGP indicate that we should reject the null. Therefore, we will skip the assumption of homogeneity of variance and conduct a one-way ANOVA with bootstrapped confidence intervals for the F-statistic.
Conducting the bootstrapped ANOVA.
# Setting the seedset.seed(616913)# Define the statistic function for bootstrapping ANOVAf_stat_boot_RefWCRatio <-function(data, index) { resampled_data <- data[index, ] # Resample the data with replacement aov_result <-aov(RefWordRatio ~ PaperType, data=resampled_data)return(summary(aov_result)[[1]]$F[1]) # Extract the F-statistic}# Perform the bootstrappingboot_results_f_RefWCRatio <-boot(data = data, statistic = f_stat_boot_RefWCRatio, R =1000)# Perform and print the original ANOVA to get the original F-statistic and p-valueoriginal_aov_RefWCRatio <-aov(RefWordRatio ~ PaperType, data=data)summary(original_aov_RefWCRatio)
Df Sum Sq Mean Sq F value Pr(>F)
PaperType 3 4.42e-05 1.473e-05 0.598 0.618
Residuals 84 2.07e-03 2.464e-05
# Calculate and report the bootstrapped confidence interval for the F-statisticboot_f_ci_RefWCRatio <-boot.ci(boot_results_f_RefWCRatio, type="bca")
Warning in norm.inter(t, adj.alpha): extreme order statistics used as endpoints
cat("Bootstrapped 95% Confidence Interval for F-statistic: ", boot_f_ci_RefWCRatio$bca[4], " to ", boot_f_ci_RefWCRatio$bca[5], "\n")
Bootstrapped 95% Confidence Interval for F-statistic: 0.00650461 to 1.600616
The results of the ANOVA and the fact that the observed F-statistic is within the bootstrapped confidence interval suggest that we should accept the null hypothesis that there is no difference between the means of the PaperType groups on the ratio of Refs to WC.
Retesting for a Main Effect of Author Number on Linguistic Obfuscation
Hypothesis 2a looks for differences between paper groups on linguistic obfuscation, but it looks like there may be a main effect of author number. I want to look at a comparison between multi-author paper groups and single-author paper groups to see if this is the case.
# Creating a new variable that isolates multi-author and single-author papersdata$S_or_M <-ifelse(data$PaperType %in%c("MAFP", "MAGP"), "MPaper", "SPaper")# Making the new variable a factor with labels denoting what they representdata$S_or_M <-factor(data$S_or_M, levels =c("SPaper", "MPaper"), labels =c("Single-Author Papers", "Multi-Author Papers"))
Now I will check assumptions for a t-test to compare these groups on linguistic obfuscation.
Checking the assumption of normality with a QQ-plot:
# To produce Q-Q plots for LingObf within the SPaper and MPaper groupsggplot(data, aes(sample = LingObf)) +stat_qq() +facet_wrap(~ S_or_M) +geom_qq_line()
The single-author papers look like there are some outliers on the top end of the distribution. So I will do a Shapiro-Wilk test for normality.
# Performing the Shapiro-Wilk test for normality within each groupsw_results_LingObf_SorM <- data %>%group_by(S_or_M) %>%summarise(SW_test_resultf =list(shapiro.test(LingObf)))# Displaying the resultssw_results_LingObf_SorM$SW_test_resultf[[1]]
Shapiro-Wilk normality test
data: LingObf
W = 0.97786, p-value = 0.5503
sw_results_LingObf_SorM$SW_test_resultf[[2]]
Shapiro-Wilk normality test
data: LingObf
W = 0.98037, p-value = 0.6484
The results of the Shapiro-Wilk test indicates that we should accept the null hypothesis that the data are normally distributed.
Now we will test for the assumption of homogeneity of variance with Levene’s test.
# Conducting and saving levene's testfollow_up_SorM_levene <-leveneTest(LingObf ~ S_or_M, data = data)# Displaying the resultsfollow_up_SorM_levene
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 6.1546 0.01505 *
86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results of Levene’s test indicate that the variances are not equal between the groups, so we should reject the assumption of homogeneity of variance and proceed to using a Welch’s t-test.
# Conducting Welch's t-test and saving resultfollow_up_SorM_welch <-t.test(LingObf ~ S_or_M, data = data, var.equal =FALSE)# Calculating the effect size for the differencecohen.d(LingObf ~ S_or_M, data = data, var.equal =FALSE)
Cohen's d
d estimate: -0.3708706 (small)
95 percent confidence interval:
lower upper
-0.79832754 0.05658634
# Displaying the resultfollow_up_SorM_welch
Welch Two Sample t-test
data: LingObf by S_or_M
t = -1.7395, df = 71.477, p-value = 0.08624
alternative hypothesis: true difference in means between group Single-Author Papers and group Multi-Author Papers is not equal to 0
95 percent confidence interval:
-1.6058955 0.1093361
sample estimates:
mean in group Single-Author Papers mean in group Multi-Author Papers
-0.3741399 0.3741399
The results of the t-test indicate that we should accept the null hypothesis that the groups are different, but the result is closer to significance than the results for Hypothesis 1a. Similarly, the effect size is larger than for Hypothesis 1a, indicating that if there is a main effect of author number it is larger than for the fraudulent group vs. the genuine group.
This is an interesting result, and it indicates that previous work might be confounded by author number.
Dehdarirad, T., & Schirone, M. (2023). Use of positive terms and certainty language in retracted and non-retracted articles: The case of biochemistry. Journal of Information Science, 1655515231176650. https://doi.org/10.1177/01655515231176650
Markowitz, D. M., & Hancock, J. T. (2014). Linguistic Traces of a Scientific Fraud: The Case of Diederik Stapel. PLoS ONE, 9(8), e105937. https://doi.org/10.1371/journal.pone.0105937
Markowitz, D. M., & Hancock, J. T. (2016). Linguistic Obfuscation in Fraudulent Science. Journal of Language and Social Psychology, 35(4), 435–445. https://doi.org/10.1177/0261927X15614605
Rocklage, M. D., He, S., Rucker, D. D., & Nordgren, L. F. (2023). Beyond Sentiment: The Value and Measurement of Consumer Certainty in Language. Journal of Marketing Research, 60(5), 870–888. https://doi.org/10.1177/00222437221134802
Schmidt, E. (2022). Can linguistic features unmask fraudulent research? A study that builds an NLP classifier to distinguish retracted papers from non-retracted papers based on text and linguistic features. [Master Thesis]. https://studenttheses.uu.nl/handle/20.500.12932/42411