Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

6 scenarios where the URL goes heavy! (With detailed code)


May 31, 2021 Article blog


Table of contents


This article is from the public number: Java Chinese community, author: Lei brother

URLs are often encountered in our daily work and interviews, such as these:

 6 scenarios where the URL goes heavy! (With detailed code)1

As can be seen, including Ali, NetEase Cloud, Youku, job help and other well-known Internet companies have appeared similar interview questions, and the URL to re-compare similar, such as IP black/white list judgment is also often in our work, so we come to this article "one disk" URL to heavy issues.

URL to re-think

Without considering the business scenario and the amount of data, we can use the following scenarios to implement duplicate judgment of the URL:

  1. Use Java's Set collection to determine whether the URL is duplicate based on the results when added (adding successfully means that the URL is not duplicate);
  2. Use the Set collection in Redis to determine if the URL is duplicate based on the results when added;
  3. Store the URLs in the database and then use SQL statements to determine if there are duplicate URLs;
  4. Set a column of URLs in the database to a unique index to determine whether the URL is duplicate based on the results when added;
  5. Use Guava's Blom filter to achieve URL weighting;
  6. Use Redis' Blom filter to achieve URL weighting.

The implementation of the above scenarios is as follows.

Url to re-implement the scenario

1. Use Java's Set collection to weigh

Set collections are inherently non-repeatable and can only be used to store elements with different values, which can fail if added with the same values, so we can determine whether the URL is duplicate by adding the results of Set collection, as follows:

public class URLRepeat {
    // 待去重 URL
    public static final String[] URLS = {
            "www.apigo.cn",
            "www.baidu.com",
            "www.apigo.cn"
    };
    public static void main(String[] args) {
        Set<String> set = new HashSet();
        for (int i = 0; i < URLS.length; i++) {
            String url = URLS[i];
            boolean result = set.add(url);
            if (!result) {
                // 重复的 URL
                System.out.println("URL 已存在了:" + url);
            }
        }
    }
}

The result of the program's execution is:

The URL already exists: www.apigo.cn

As you can see from the above results, you can use Set collections to implement the weight-weighting function of the URL.

2. Redis Set Collection Goes Heavy

The implementation idea of using Redis's Set collection is consistent with the Set collection idea in Java, both of which take advantage of Set non-repeatability, and let's first use Redis' client redis-cli to implement an example of URL weighting:

 6 scenarios where the URL goes heavy! (With detailed code)2

As you can see from the above results, when the addition is successful, the URL is not duplicated, but when the addition fails (resulting in 0) it indicates that the URL already exists.

Let's implement Redis's Set de-re-emphasis in a code-like way, and the code is as follows:

// 待去重 URL
public static final String[] URLS = {
    "www.apigo.cn",
    "www.baidu.com",
    "www.apigo.cn"
};


@Autowired
RedisTemplate redisTemplate;


@RequestMapping("/url")
public void urlRepeat() {
    for (int i = 0; i < URLS.length; i++) {
        String url = URLS[i];
        Long result = redisTemplate.opsForSet().add("urlrepeat", url);
        if (result == 0) {
            // 重复的 URL
            System.out.println("URL 已存在了:" + url);
        }
    }
}

The results of the above procedures are:

The URL already exists: www.apigo.cn

In the above code, we used RedisTemplate in Spring Data to implement the RedisTemplate object in the Spring Boot project we need to first introduce spring-boot-starter-data-redis framework, the configuration information is as follows:

<!-- 添加操作 RedisTemplate 引用 -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-redis</artifactId>
</dependency>

You then need to configure Redis' connection information in the project again, as follows in application.properties

spring.redis.host=127.0.0.1
spring.redis.port=6379
#spring.redis.password=123456 # Redis 服务器密码,有密码的话需要配置此项

After these two steps, we can normally use RedisTemplate object to manipulate Redis in Spring Boot's projects.

 6 scenarios where the URL goes heavy! (With detailed code)3

3. The database goes heavy

We can also use the database to implement duplicate judgment of the URL, first we design a storage table for the URL, as shown in the following image:

 6 scenarios where the URL goes heavy! (With detailed code)4

The SQL for this table is as follows:

/*==============================================================*/
/* Table: urlinfo                                               */
/*==============================================================*/
create table urlinfo
(
   id                   int not null auto_increment,
   url                  varchar(1000),
   ctime                date,
   del                  boolean,
   primary key (id)
);


/*==============================================================*/
/* Index: Index_url                                             */
/*==============================================================*/
create index Index_url on urlinfo
(
   url
);

Where id is the primary key for self-increasing, and url field is set to index, the index is set to speed up queries.

Let's add two test data to the database, as shown in the following image:

 6 scenarios where the URL goes heavy! (With detailed code)5

Let's use SQL statement queries, as shown in the following image:

 6 scenarios where the URL goes heavy! (With detailed code)6

If the result is greater than 0, there is already a duplicate URL, otherwise there is no duplicate URL.

4. Unique index de-heavy

We can also use the database's unique index to prevent URL duplication, which is very similar to the idea of the previous Set collection.

First we set a unique index for the field URL, and then we add URL data, which, if successful, means that the URL is not duplicated, and vice versa.

The SQL implementation for creating a unique index is as follows:

create unique index Index_url on urlinfo
(
   url
);

5. Guava Blom filter goes heavy

The Bloom Filter was proposed by Blom in 1970. I t is actually a long binary vector and a series of random mapping functions. T he Blom filter can be used to retrieve whether an element is in a collection. Its advantages are that spatial efficiency and query time far exceed the general algorithm, the disadvantage is that there is a certain rate of misidentification and deletion difficulties.

The core implementation of the Blom filter is an oversized set of bits and several hash functions, assuming that the length of the number of bit groups is m and the number of hash functions is k.

 6 scenarios where the URL goes heavy! (With detailed code)7

As an example, the above diagram shows the specific operation flow: suppose there are 3 elements in the collection, the number of hash functions is 3. S tart by initializing the number of bit groups, setting bit 0 for each bit inside. F or each element in the collection, the elements are mapped in turn through three hash functions, each of which produces a hash value that corresponds to a point above the number of bits group, and then marks the corresponding position of the number of bits group as 1, the same method maps W through the hash to 3 points on the bit group when querying for the existence of the W element in the collection. I f one of the three points is not 1, you can tell that the element must not exist in the collection. C onversely, if all three points are 1, the element may exist in the collection. N ote: It is not possible to determine whether the element must exist in the collection and there may be a certain rate of misjudgment. A s you can see from the diagram, suppose an element is mapped to the three points below 4, 5, and 6. Although all three points are 1, it is clear that these three points are where different elements are hashed, so this situation indicates that the elements, although not in the collection, may also correspond to 1, which is why the misjudgment rate exists.

We can operate the Blom filter with the Guava framework provided by Google, so that we can first add a reference to Guava in pom.xml configured as follows:

<!-- 添加 Guava 框架 -->
<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.2-jre</version>
</dependency>

Implementation code for URL weight:

public class URLRepeat {
    // 待去重 URL
    public static final String[] URLS = {
            "www.apigo.cn",
            "www.baidu.com",
            "www.apigo.cn"
    };


    public static void main(String[] args) {
        // 创建一个布隆过滤器
        BloomFilter<String> filter = BloomFilter.create(
                Funnels.stringFunnel(Charset.defaultCharset()),
                10, // 期望处理的元素数量
                0.01); // 期望的误报概率
        for (int i = 0; i < URLS.length; i++) {
            String url = URLS[i];
            if (filter.mightContain(url)) {
                // 用重复的 URL
                System.out.println("URL 已存在了:" + url);
            } else {
                // 将 URL 存储在布隆过滤器中
                filter.put(url);
            }
        }
    }
}

The results of the above procedures are:

The URL already exists: www.apigo.cn

6. Redis Blom filter to go heavy

In addition to Guava Blom filter, we can also use Redis' Blom filter to achieve URL weighting. Before we use it, we want to make sure that the Redis server version is greater than 4.0 (this version and above) and that the Redis Blom filter feature is turned on for proper use.

In Docker case, let's demonstrate the Redis Blom filter installation and opening, first downloading redis' Blom passer, and then turning on the Blum filter when restarting the Redis service, as shown in the following image:

 6 scenarios where the URL goes heavy! (With detailed code)8

Blom filter use: After the Blom filter is turned on normally, we first use Redis' client redis-cli to implement the Blom filter URL weighting, the implementation command is as follows:

 6 scenarios where the URL goes heavy! (With detailed code)9

In Redis, the Blom filter does not have many commands, mainly the following:

  • bf.add elements;
  • bf.exists to determine whether an element exists;
  • bf.madd adds multiple elements;
  • bf.mexists determine the existence of multiple elements;
  • bf.reserve sets the accuracy of the Blom filter.

Let's use code to demonstrate the use of Redis Blom filters:

import redis.clients.jedis.Jedis;
import utils.JedisUtils;


import java.util.Arrays;


public class BloomExample {
    // 布隆过滤器 key
    private static final String _KEY = "URLREPEAT_KEY";

    
    // 待去重 URL
    public static final String[] URLS = {
            "www.apigo.cn",
            "www.baidu.com",
            "www.apigo.cn"
    };


    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
         for (int i = 0; i < URLS.length; i++) {
            String url = URLS[i];
            boolean exists = bfExists(jedis, _KEY, url);
            if (exists) {
                // 重复的 URL
                System.out.println("URL 已存在了:" + url);
            } else {
                bfAdd(jedis, _KEY, url);
            }
        }
    }


    /**
     * 添加元素
     * @param jedis Redis 客户端
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfAdd(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }


    /**
     * 查询元素是否存在
     * @param jedis Redis 客户端
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfExists(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }
}

The results of the above procedures are:

The URL already exists: www.apigo.cn

summary

This article describes six URL de-heavy scenarios, four of which are Redis Set, Redis Blom filters, databases, and unique indexes for distributed systems, and in the case of mass distributed systems, it is recommended to use Redis Blom filters to achieve URL de-weighting, and if it is a single-machine mass of data, Guava's blondware is recommended to implement URL de-weighting.

Here are W3Cschool编程狮 about URL de-weighting! ( with detailed code). Related to the introduction, I hope to help you.