Hello, welcome back again. In this blog post, I would like to share my progress on the Google Summer of Code with OpenSUSE for the RPMLint project for week 8.

As mentioned in the previous blog post, I started to work on DuplicatesCheck.py. Here is a detailed overview of my progress.

Progress [########……..]

For this week, I decided to focus on DuplicatesCheck.py. However, I soon realized that the tests required some additional capabilities of the FakePkg class.

Here is the current interface of FakePkg we use to create a mockPkg:

1
2
3
4
5
6
7
def get_tested_mock_package(files=None, real_files=None, header=None):
    mockPkg = FakePkg('mockPkg')
    if files is not None:
        mockPkg.create_files(files, real_files)
    if header is not None:
        mockPkg.add_header(header)
    return mockPkg

Using the header argument alone doesn’t provide all the required information for some tests. Certain tests require much more information, which cannot be directly passed through the header parameter.

For the current discussion with DuplicatesCheck, the test function I am considering is test_unexpanded_macros in the file test_files.py. For this test, we need what’s called md5 hash values. These are the hashes of files that are in a binary RPM file.

To learn more about MD5 follow this wiki link

For example, consider the test function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@pytest.mark.parametrize('package', ['binary/duplicates'])
def test_duplicates(tmp_path, package, duplicatescheck):
    output, test = duplicatescheck
    test.check(get_tested_package(package, tmp_path))
    out = output.print_results(output.results)

    assert 'E: hardlink-across-partition /var/foo /etc/foo' in out
    assert 'E: hardlink-across-config-files /var/foo2 /etc/foo2' in out
    assert 'W: files-duplicate /etc/bar3 /etc/bar:/etc/bar2' in out
    assert 'W: files-duplicate /etc/strace2.txt /etc/strace1.txt' in out
    assert 'W: files-duplicate /etc/small2 /etc/small' not in out
    assert 'E: files-duplicated-waste 270544' in out

Here, the binary file we are checking is test/binary/duplicates-0-0.x86_64.rpm, and the hashes of all the files are generated by md5 to find duplicate files with the same hashes. See the list of files in the below output sample of an RPM command to list all the files in a binary RPM file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ rpm -qlp test/binary/duplicates-0-0.x86_64.rpm

/etc/bar
/etc/bar2
/etc/bar3
/etc/foo
/etc/foo2
/etc/small
/etc/small2
/etc/strace1.txt
/etc/strace2.txt
/var/foo
/var/foo2

After hashing these files using md5 within pytest runtime, I obtained the hash values of these files. Because there are duplicate files, the same hash values are generated and stored with a key-value data structure (Dictionary). See the output sample below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
 ==> (Pdb) p md5s

    {
    'b3ab937fbdc55ae7bf96749074e816056f0605491d419f9f5b97dc00c8c04aae': 
        {
            <rpmlint.pkgfile.PkgFile object at 0x7f69c1189430>,
            <rpmlint.pkgfile.PkgFile object at 0x7f69c1189640>,
            <rpmlint.pkgfile.PkgFile object at 0x7f69c11894e0>
        },
    'bc1a4e47244cdf6b4c4735453cf55503a995334f1735458ab2e3c01455e159e3': 
        {
            <rpmlint.pkgfile.PkgFile object at 0x7f69c118a090>,
            <rpmlint.pkgfile.PkgFile object at 0x7f69c11897a0>
        },
    'e18b816e748b2366af6cb7281bcf8fca7f65be8a41e456e562f6acd8b267fc32': 
        {
            <rpmlint.pkgfile.PkgFile object at 0x7f69c1189900>,
            <rpmlint.pkgfile.PkgFile object at 0x7f69c118a1f0>
        },
    '1618780f802ed0571225ec155527f82a0eaa540d16983c387baede6208ced745': 
        {
            <rpmlint.pkgfile.PkgFile object at 0x7f69c1189f30>,
            <rpmlint.pkgfile.PkgFile object at 0x7f69c1189dd0>
        }
    }

As shown, there are 9 files that are hashed, and some files have the same hash values. Additionally, in the rpm query, there are a total of 11 files. This differnece is because there are very small files, rpmlint ignores them. The file size limit id defined in configuration file, which are less than the minimum file size limit, all files will be ignored.

And yes, I found out these hash values using the Python Debugger (Pdb). These values are stored in a variable md5s during runtime in the DuplicatesCheck.py file. Here

I believe this would work, provided that we implement the md5 variable within the Pkgfile class and pass header information while creating a mock package using FakePkg. I am not sure whether I should hard-code these hash values into the header object for passing to the test function. I even tried creating real files using the real_files=True argument, but it didn’t work.

Misc.

In addition to working on DuplicatesCheck.py, I have also made some progress with FilesCheck.py. This also requires some more capabilities of FakePkg. However, I haven’t explored all the possible ways to create files and test them yet. I plan to do that in the coming week.

As mentioned in my last post, I will be visiting the SUSE office in Bangalore. I will share the visit date. I am planning to visit around the 3rd week of August. I will also share the details on my LinkedIn page. Do follow me on LinkedIn.


Links: