This is based on my personal experience working on MongoDB with a Java app.
My Java app is a Springboot app using Spring Data MongoDB to connect with the database. And here is the maven dependency that I used.
Before jumping into the problem, let me just brief about the indexes that we are using here.
Indexes
Indexes are set to a collection because, during my operations, I might need the collections to be queried based on these indexed fields faster and efficiently.
Unique index: This could be a field that is unique for each document that you are saving in the collection. This field can be a Candidate key which would be a field where you could use to uniquely identify a document. Sometimes these indexes are created as a validation in the database level. So, you cannot insert two similar records in the DB.
Background index: These indexes are based on when the index record is created. Normally, when a collection (comparable to a Table of relational DB) has an index created to it, whenever a new record is added to this collection, at the same time an index record is also created. So the background index is not created at the same time as the actual record is created, instead the indexing is happening in the background and it gives the priority to the actual data to be recorded. Therefore when the index is happening in the background, the read-write operations tend to be much faster because it doesn't block you until the index is created. By default, an index is always running background unless specified otherwise. (By the way, the newer versions of Mongo after 4.2 this could be different as the time delay could be pretty much the same as for background and foreground indexing).
I must say, I'm not an expert in how MongoDB's internal works, but I believe it will make sense with this explanation.
In the Java representation, you can configure the indexes like below.
We have a Java entity called Item (just an example representation of anything, this can be an item of an ordering list or smth). This one has an Id which is also the primary key of this document where we have annotated with @Id to say this. Also, we have a serial number
Problem scenario
Let's say my collection is supposed to have an extremely large number of data like several millions of records. It will take some time to complete the indexing for a collection (Depending on the DB server capacity of course) given the insert rate is also much higher. In this kind of scenario, I would go to running the index in the background. Because I don't want to make my transactions slower and have a slow response time for my application.
Let's say my transactions don't keep the uniqueness, but I'm trying to rely on the DB unique index instead. However, my index is running in the background. Therefore, if one of the transactions inserts a duplicate record with the same serial number into my DB because the unique index is not completely created yet.
So, let's say I have duplicate records, but since my index is not completed yet, it is allowing me to insert or have the existing records duplicated. So when the indexing process is coming over to this duplicated record, it will throw a duplicate index error. If the index is annotated in the Java entity, it will throw an exception.
org.springframework.dao.DuplicateKeyException: Write failed with error code 11000 and error message 'E11000 duplicate key error collection: <some db name>.items index: serialNumber dup key
If there's a duplicate anomaly exists in the data, when a new Java app starts up with the unique index, it will fail to start.
Solution
If you need to create a unique index, don't run it in the background. If you need to index it anyway, then make sure that you don't have the duplication in your data source. This can be either done with the code level as well, just to check before it inserts. Otherwise, you need to have a mechanism to remove duplicates from the source data before running the index.
My Java app is a Springboot app using Spring Data MongoDB to connect with the database. And here is the maven dependency that I used.
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-mongodb</artifactId> </dependency>
Before jumping into the problem, let me just brief about the indexes that we are using here.
Indexes
Indexes are set to a collection because, during my operations, I might need the collections to be queried based on these indexed fields faster and efficiently.
Unique index: This could be a field that is unique for each document that you are saving in the collection. This field can be a Candidate key which would be a field where you could use to uniquely identify a document. Sometimes these indexes are created as a validation in the database level. So, you cannot insert two similar records in the DB.
Background index: These indexes are based on when the index record is created. Normally, when a collection (comparable to a Table of relational DB) has an index created to it, whenever a new record is added to this collection, at the same time an index record is also created. So the background index is not created at the same time as the actual record is created, instead the indexing is happening in the background and it gives the priority to the actual data to be recorded. Therefore when the index is happening in the background, the read-write operations tend to be much faster because it doesn't block you until the index is created. By default, an index is always running background unless specified otherwise. (By the way, the newer versions of Mongo after 4.2 this could be different as the time delay could be pretty much the same as for background and foreground indexing).
I must say, I'm not an expert in how MongoDB's internal works, but I believe it will make sense with this explanation.
In the Java representation, you can configure the indexes like below.
import org.springframework.data.annotation.Id; import org.springframework.data.mongodb.core.index.Indexed; import org.springframework.data.mongodb.core.mapping.Document; @Document(collection = "items") public class Item { @Id private String id; @Indexed(unique = true, background = true) private String serialNumber; private String name; // rest of the code }
We have a Java entity called Item (just an example representation of anything, this can be an item of an ordering list or smth). This one has an Id which is also the primary key of this document where we have annotated with @Id to say this. Also, we have a serial number
Problem scenario
Let's say my collection is supposed to have an extremely large number of data like several millions of records. It will take some time to complete the indexing for a collection (Depending on the DB server capacity of course) given the insert rate is also much higher. In this kind of scenario, I would go to running the index in the background. Because I don't want to make my transactions slower and have a slow response time for my application.
Let's say my transactions don't keep the uniqueness, but I'm trying to rely on the DB unique index instead. However, my index is running in the background. Therefore, if one of the transactions inserts a duplicate record with the same serial number into my DB because the unique index is not completely created yet.
So, let's say I have duplicate records, but since my index is not completed yet, it is allowing me to insert or have the existing records duplicated. So when the indexing process is coming over to this duplicated record, it will throw a duplicate index error. If the index is annotated in the Java entity, it will throw an exception.
org.springframework.dao.DuplicateKeyException: Write failed with error code 11000 and error message 'E11000 duplicate key error collection: <some db name>.items index: serialNumber dup key
If there's a duplicate anomaly exists in the data, when a new Java app starts up with the unique index, it will fail to start.
Solution
If you need to create a unique index, don't run it in the background. If you need to index it anyway, then make sure that you don't have the duplication in your data source. This can be either done with the code level as well, just to check before it inserts. Otherwise, you need to have a mechanism to remove duplicates from the source data before running the index.
Comments
Post a Comment