May 31, 2021 Article blog
This article is from the public number: Java Chinese community, author: Lei brother
URLs are often encountered in our daily work and interviews, such as these:
As can be seen, including Ali, NetEase Cloud, Youku, job help and other well-known Internet companies have appeared similar interview questions, and the URL to re-compare similar, such as IP black/white list judgment is also often in our work, so we come to this article "one disk" URL to heavy issues.
Without considering the business scenario and the amount of data, we can use the following scenarios to implement duplicate judgment of the URL:
The implementation of the above scenarios is as follows.
Set
collections are inherently non-repeatable and can only be used to store elements with different values, which can fail if added with the same values, so we can determine whether the URL is duplicate by adding the results of
Set
collection, as follows:
public class URLRepeat {
// 待去重 URL
public static final String[] URLS = {
"www.apigo.cn",
"www.baidu.com",
"www.apigo.cn"
};
public static void main(String[] args) {
Set<String> set = new HashSet();
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
boolean result = set.add(url);
if (!result) {
// 重复的 URL
System.out.println("URL 已存在了:" + url);
}
}
}
}
The result of the program's execution is:
The URL already exists: www.apigo.cn
As you can see from the above results, you can use
Set
collections to implement the weight-weighting function of the URL.
The implementation idea of using
Redis's
Set
collection is consistent with the
Set
collection idea in
Java,
both of which take advantage of
Set
non-repeatability, and let's first use
Redis'
client
redis-cli
to implement an example of URL weighting:
As you can see from the above results, when the addition is successful, the URL is not duplicated, but when the addition fails (resulting in 0) it indicates that the URL already exists.
Let's implement
Redis's
Set
de-re-emphasis in a code-like way, and the code is as follows:
// 待去重 URL
public static final String[] URLS = {
"www.apigo.cn",
"www.baidu.com",
"www.apigo.cn"
};
@Autowired
RedisTemplate redisTemplate;
@RequestMapping("/url")
public void urlRepeat() {
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
Long result = redisTemplate.opsForSet().add("urlrepeat", url);
if (result == 0) {
// 重复的 URL
System.out.println("URL 已存在了:" + url);
}
}
}
The results of the above procedures are:
The URL already exists: www.apigo.cn
In the above code, we used
RedisTemplate
in
Spring Data
to implement the
RedisTemplate
object in the
Spring Boot
project we need to first introduce
spring-boot-starter-data-redis
framework, the configuration information is as follows:
<!-- 添加操作 RedisTemplate 引用 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>
You then need to configure
Redis'
connection information in the project again, as follows in
application.properties
spring.redis.host=127.0.0.1
spring.redis.port=6379
#spring.redis.password=123456 # Redis 服务器密码,有密码的话需要配置此项
After these two steps, we can normally use
RedisTemplate
object to manipulate
Redis
in
Spring Boot's
projects.
We can also use the database to implement duplicate judgment of the URL, first we design a storage table for the URL, as shown in the following image:
The SQL for this table is as follows:
/*==============================================================*/
/* Table: urlinfo */
/*==============================================================*/
create table urlinfo
(
id int not null auto_increment,
url varchar(1000),
ctime date,
del boolean,
primary key (id)
);
/*==============================================================*/
/* Index: Index_url */
/*==============================================================*/
create index Index_url on urlinfo
(
url
);
Where
id
is the primary key for self-increasing, and
url
field is set to index, the index is set to speed up queries.
Let's add two test data to the database, as shown in the following image:
Let's use SQL statement queries, as shown in the following image:
If the result is greater than 0, there is already a duplicate URL, otherwise there is no duplicate URL.
We can also use the database's unique index to prevent URL duplication, which is very similar to the idea of the previous Set collection.
First we set a unique index for the field URL, and then we add URL data, which, if successful, means that the URL is not duplicated, and vice versa.
The SQL implementation for creating a unique index is as follows:
create unique index Index_url on urlinfo
(
url
);
The Bloom Filter was proposed by Blom in 1970. I t is actually a long binary vector and a series of random mapping functions. T he Blom filter can be used to retrieve whether an element is in a collection. Its advantages are that spatial efficiency and query time far exceed the general algorithm, the disadvantage is that there is a certain rate of misidentification and deletion difficulties.
The core implementation of the Blom filter is an oversized set of bits and several hash functions, assuming that the length of the number of bit groups is m and the number of hash functions is k.
As an example, the above diagram shows the specific operation flow: suppose there are 3 elements in the collection, the number of hash functions is 3. S tart by initializing the number of bit groups, setting bit 0 for each bit inside. F or each element in the collection, the elements are mapped in turn through three hash functions, each of which produces a hash value that corresponds to a point above the number of bits group, and then marks the corresponding position of the number of bits group as 1, the same method maps W through the hash to 3 points on the bit group when querying for the existence of the W element in the collection. I f one of the three points is not 1, you can tell that the element must not exist in the collection. C onversely, if all three points are 1, the element may exist in the collection. N ote: It is not possible to determine whether the element must exist in the collection and there may be a certain rate of misjudgment. A s you can see from the diagram, suppose an element is mapped to the three points below 4, 5, and 6. Although all three points are 1, it is clear that these three points are where different elements are hashed, so this situation indicates that the elements, although not in the collection, may also correspond to 1, which is why the misjudgment rate exists.
We can operate the Blom filter with the
Guava
framework provided by Google, so that we can first add a reference to
Guava
in
pom.xml
configured as follows:
<!-- 添加 Guava 框架 -->
<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.2-jre</version>
</dependency>
Implementation code for URL weight:
public class URLRepeat {
// 待去重 URL
public static final String[] URLS = {
"www.apigo.cn",
"www.baidu.com",
"www.apigo.cn"
};
public static void main(String[] args) {
// 创建一个布隆过滤器
BloomFilter<String> filter = BloomFilter.create(
Funnels.stringFunnel(Charset.defaultCharset()),
10, // 期望处理的元素数量
0.01); // 期望的误报概率
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
if (filter.mightContain(url)) {
// 用重复的 URL
System.out.println("URL 已存在了:" + url);
} else {
// 将 URL 存储在布隆过滤器中
filter.put(url);
}
}
}
}
The results of the above procedures are:
The URL already exists: www.apigo.cn
In addition to
Guava
Blom filter, we can also use
Redis'
Blom filter to achieve URL weighting.
Before we use it, we want to make sure that
the Redis
server version is greater than 4.0 (this version and above) and that
the Redis
Blom filter feature is turned on for proper use.
In
Docker
case, let's demonstrate the
Redis
Blom filter installation and opening, first
downloading redis'
Blom passer, and then turning on the Blum filter when restarting
the Redis
service, as shown in the following image:
Blom filter use:
After the Blom filter is turned on normally, we first use
Redis'
client
redis-cli
to implement the Blom filter URL weighting, the implementation command is as follows:
In Redis, the Blom filter does not have many commands, mainly the following:
- bf.add elements;
- bf.exists to determine whether an element exists;
- bf.madd adds multiple elements;
- bf.mexists determine the existence of multiple elements;
- bf.reserve sets the accuracy of the Blom filter.
Let's use code to demonstrate the use of Redis Blom filters:
import redis.clients.jedis.Jedis;
import utils.JedisUtils;
import java.util.Arrays;
public class BloomExample {
// 布隆过滤器 key
private static final String _KEY = "URLREPEAT_KEY";
// 待去重 URL
public static final String[] URLS = {
"www.apigo.cn",
"www.baidu.com",
"www.apigo.cn"
};
public static void main(String[] args) {
Jedis jedis = JedisUtils.getJedis();
for (int i = 0; i < URLS.length; i++) {
String url = URLS[i];
boolean exists = bfExists(jedis, _KEY, url);
if (exists) {
// 重复的 URL
System.out.println("URL 已存在了:" + url);
} else {
bfAdd(jedis, _KEY, url);
}
}
}
/**
* 添加元素
* @param jedis Redis 客户端
* @param key key
* @param value value
* @return boolean
*/
public static boolean bfAdd(Jedis jedis, String key, String value) {
String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
Object result = jedis.eval(luaStr, Arrays.asList(key, value),
Arrays.asList());
if (result.equals(1L)) {
return true;
}
return false;
}
/**
* 查询元素是否存在
* @param jedis Redis 客户端
* @param key key
* @param value value
* @return boolean
*/
public static boolean bfExists(Jedis jedis, String key, String value) {
String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
Object result = jedis.eval(luaStr, Arrays.asList(key, value),
Arrays.asList());
if (result.equals(1L)) {
return true;
}
return false;
}
}
The results of the above procedures are:
The URL already exists: www.apigo.cn
This article describes six URL de-heavy scenarios, four of which are Redis Set, Redis Blom filters, databases, and unique indexes for distributed systems, and in the case of mass distributed systems, it is recommended to use Redis Blom filters to achieve URL de-weighting, and if it is a single-machine mass of data, Guava's blondware is recommended to implement URL de-weighting.
Here are
W3Cschool编程狮
about URL de-weighting! (
with detailed code).
Related to the introduction, I hope to help you.